LLM recommendation bias gives famous brands an edge

Shopping search used to be a shelf: ranked links, star ratings, ads, filters, and your own patience. Now the shelf talks. That makes the person who builds the shelf more powerful, and it makes the copy you feed into it more dangerous.

LLM recommendation bias is now measurable enough to plan around: Chu and Hou found well known skincare brands were recommended 100 percent of the time when products had identical specs.

A new paper, submitted to arXiv on June 16, 2026, tests how large language models recommend consumer products when brand reputation, ratings, and marketing language collide. The primary keyword is LLM recommendation bias, and the key number is brutal: 100 percent recommendation share for the incumbent under equal product specs. The paper tests three commercial models, GPT-4o-mini, Claude Sonnet, and Gemini 3 Flash, mostly in skincare, a category where buyers often cannot judge quality before purchase.

This matters because AI shopping is leaving the demo stage. OpenAI says ChatGPT can show product options with images, product details, and purchase links when a query suggests shopping intent. Google said in May 2025 that AI Mode was being built for shopping from inspiration through buying. Adobe’s January 2025 survey found that one in five consumers used generative AI for product discovery at least once per week.

The headline read is simple. If you sell a product, your brand page is becoming model input. If you run a marketplace, your recommender is becoming a competition regulator. If you build an AI assistant, you now own a slice of the consumer decision stack whether you wanted that job or not.

What did the paper actually test in LLM recommendations?

Chu and Hou call the first result a “Conditional Monopoly,” and their abstract says well known brands were recommended 100 percent of the time with an Incumbent Advantage Index of 10.0 when all products had the same specifications. That is the part marketers will screenshot. The more useful part is the condition: the monopoly disappeared when a competitor gained less than a 0.1 star rating advantage.

That second number changes the interpretation. The models did favor the familiar brand, but they did not treat brand as destiny. They behaved like systems that use brand as a proxy when other signals tie, then drop it quickly when a cleaner quality signal appears. For a builder, that is the difference between a rigged game and a brittle heuristic.

The authors then tested what happens when product descriptions use authority style marketing language. Their abstract says fabricated clinical evidence claims broke the incumbent monopoly at a Bias Surplus Value of plus 0.17 rating points. Put plainly: a small rating edge from reputation could be offset by language that made the challenger sound more evidence backed, even when the clinical claims were fabricated in the experiment.

The third experiment is the one that should worry every growth team. The paper says that in multi brand GEO competition, the payoff proxy fell from plus 0.802 to plus 0.007 when all brands adopted the same optimization strategy. It also says non participating brands received zero recommendations in the tests.

The chart below shows that collapse. The useful story is not the exact payoff formula, which is the authors’ proxy. The useful story is the shape: optimization works when few people do it, then becomes table stakes when everyone piles in.

Source: Chu and Hou, arXiv 2606.17443. The GEO payoff proxy falls from 0.802 when one brand optimizes to 0.007 when all brands optimize. Data Today benchmark.

This matches the older GEO literature more than the SEO analogy most people reach for. The first paper to formalize generative engine optimization described GEO as a black box framework for improving source visibility in generative engine responses. Chu and Hou move the problem from “can a page get cited?” to “can a brand steer a recommendation?” That is a more expensive question because it touches revenue, product trust, and consumer law at once.

If you want the background on why builders started caring about answer engine visibility in the first place, Data Today covered the mechanics in how to rank inside an LLM. The new paper adds a harsher lesson: once ranking becomes recommendation, content optimization starts to look like market design.

Why should builders care about a skincare experiment?

Skincare sounds soft until you map it to your own funnel. The paper picked skincare because it is an experience good, meaning buyers struggle to verify quality before using it. That describes plenty of software too: security tools, AI agents, analytics platforms, workflow automation, and most vertical SaaS bought by a non technical executive.

When the buyer asks an LLM “which tool should I use?” the model has to compress messy signals into a short list. Brand becomes a shortcut. Ratings become a shortcut. Claims like “clinically tested,” “SOC 2 ready,” or “benchmarked on production workloads” become shortcuts too. The danger is that shortcuts scale better than diligence.

Here is the builder consequence:

If you are an incumbent, the paper’s 100 percent result says brand memory can still carry you when evidence is tied, but the less than 0.1 star reversal says a small public quality signal can puncture that edge.
If you are a challenger, the plus 0.17 Bias Surplus Value result says language can move the model, but fabricated authority claims create legal and trust debt you may never pay down.
If you run a marketplace or assistant, the zero recommendation result for non participants says your system can accidentally punish brands that refuse to play the optimization game.
If you sell through AI search, the 0.802 to 0.007 payoff collapse says GEO will become a cost center once competitors copy the tactic.

That last point is underrated. A growth hack with a private payoff becomes a tax when the whole category uses it. SEO taught this lesson over 20 years. GEO may teach it faster because LLMs compress the surface area: many links become one answer, many products become three cards, many brands become a sentence.

The legal edge is sharper than the marketing edge. The Federal Trade Commission says advertisers need a reasonable basis for objective product claims, and it says health or safety claims must be backed by scientific evidence. The agency’s health products guidance says randomized controlled human clinical trials are generally the most reliable evidence for health benefit claims. That matters because the paper’s manipulated language includes fabricated clinical evidence claims, exactly the kind of copy a desperate growth team might rationalize as “model friendly.”

For AI product teams, there is also a system design problem. OpenAI says product results in ChatGPT Search are selected independently and are not ads. The same help page says product review summaries and some labels are generated from available information and are not guarantees or verified statements. That gap between independent selection and unverified input is where manipulation lives.

The business risk is boring in the way expensive things are boring. Your roadmap suddenly needs recommendation audits, product claim verification, merchant data freshness, and appeal paths for excluded brands. Your support team will get the complaint when the model recommends the wrong serum, shoe, laptop, or compliance vendor. Your sales team will ask why a competitor with worse specs keeps showing up in the assistant. Your legal team will ask who approved the “clinically proven” sentence the model kept repeating.

What should you change before your product depends on AI shopping?

Start by separating visibility work from evidence work. GEO is real enough to budget for, but the paper’s numbers argue against treating it as magic. A 100 percent incumbent edge under identical specs can vanish with a tiny rating advantage, and a 0.802 payoff can collapse to 0.007 when everyone copies the tactic.

For sellers, the first move is a claim ledger. Every product claim you want an LLM to repeat should map to a source, owner, and review date. If the claim says “clinical,” “certified,” “fastest,” “most secure,” or “best price,” it needs evidence that can survive a regulator, a journalist, or a competitor. The FTC’s updated endorsement guides say review practices can be unfair or deceptive when companies boost, organize, suppress, or edit reviews to distort consumer opinion, so do not launder weak claims through review snippets.

For assistant builders, use adversarial product evals. Do not only test whether the model recommends reasonable products. Test whether a synthetic challenger can win by adding authority language, fake studies, or inflated review summaries. Chu and Hou’s paper uses 4 figures and 11 tables, which is a reminder that recommendation behavior deserves measurement, not vibes.

A practical evaluation suite should include at least four cases:

Equal specs with incumbent and challenger brands, so you can measure default brand pull.
Slight rating deltas such as 0.1 stars, so you can see when quality signals beat reputation.
Authority language with unsupported claims, so you can test whether the model rewards bogus evidence.
Multi brand optimization, so you can see whether the system collapses into everyone saying the same thing.

For marketplaces, give brands a clean data path. OpenAI says merchants can provide a direct product feed so ChatGPT can better reflect current product information. That is the right instinct, but feeds should not become a paid or privileged moat unless you want the recommender to hard code incumbent advantage under a nicer name.

For investors and operators, watch costs. If your category moves from SEO to GEO, content budgets will shift from pages and backlinks to structured product data, evidence libraries, evaluator runs, and monitoring. The winners will not be the teams with the loudest adjectives. They will be the teams whose claims are easiest for a model to verify and hardest for a competitor to imitate.

What happens when every brand writes for the model?

The paper’s social dilemma result is the best reason to stay calm. When one brand optimizes, it can gain. When every brand optimizes, the private advantage nearly disappears. The authors report a payoff proxy drop from 0.802 to 0.007, which is roughly a 99 percent collapse by simple calculation from their two values.

That does not mean GEO dies. It means GEO becomes infrastructure. Schema markup did not disappear because everyone learned it. Product feeds did not disappear because every retailer had one. They became the boring floor beneath the real contest.

The real contest will be evidence quality. If an LLM has to choose among five brands that all say “dermatologist tested,” the assistant needs a way to know which claim points to a real study, which points to a survey, which points to a paid influencer, and which points to air with a lab coat on. OpenAI’s shopping research announcement says the system researches across the internet, reviews quality sources, and builds a personalized buyer’s guide in minutes. Minutes are useful. Verification still has to be designed.

This is where product teams can build a moat. Make your public evidence machine readable. Publish current specs. Keep prices synchronized. Expose certifications. Put review provenance in order. If your claims require a sales call to verify, do not be surprised when a model prefers the competitor whose proof is dull, public, and structured.

Who owns the shelf when the shelf talks back?

LLM recommendation bias gives incumbents an edge, but the edge is fragile enough to fight and risky enough to audit. The paper’s 100 percent result should scare challengers. The plus 0.17 result should scare regulators. The 0.802 to 0.007 result should scare anyone planning to sell GEO as a permanent moat.

The next shopping interface will not feel like search. It will feel like advice. Advice has a higher duty than ranking.

LLM recommendation bias gives famous brands an edge

What did the paper actually test in LLM recommendations?

Why should builders care about a skincare experiment?

What should you change before your product depends on AI shopping?

What happens when every brand writes for the model?

Who owns the shelf when the shelf talks back?

Sources

More from Research

SysAdmin benchmark: reasoning models seek power 58% of the time

Anthropic J-space reveals LLM thoughts, but not fully

Format Sensitivity Index exposes LLM benchmark gaps