Why one ChatGPT query tells you almost nothing about your brand visibility

We wanted to know how much that single screenshot is actually worth as evidence. So we took 19 real consumer prompts pulled from anonymized panel data of how people talk to AI chatbots, ran each one through ChatGPT ten times under identical conditions (same model, same US location, same default settings, web search enabled), and looked at how stable the brand recommendations were across the ten runs.

The short version: two runs of the same prompt share, on average, only 55% of the brands that get recommended. About one in four brands that ChatGPT recommends only shows up in one of ten runs. Less than one in six show up in all ten. The single screenshot is hiding more than it shows.

Here is what we did and what we found.

The setup

We selected 19 prompts from publicly available research-purpose data on real chatbot conversations. Every prompt was chosen to be a category-level brand recommendation question that a real person actually asked an AI chatbot. Examples: “what is the best phone”, “Best natural multivitamin and minerals”, “give me a configuration for a good gaming pc”, “best hotel in toronto”, “Give me the top 10 highest yield savings accounts available right now”. We excluded any prompt that named a specific brand, anything from professional or research contexts (consultant queries, market research, business plan questions), and anything outside English.

Each prompt was sent to ChatGPT ten times. Same model, same settings, US location, web search enabled in every call. Then we used a separate, cheaper model to extract the list of brand names that the response recommended in each run (filtering out brands mentioned only as comparisons or warnings), and normalized minor naming variations into canonical brand entries. The methodology follows our broader hypothesis-first testing approach: pre-stated hypothesis, controlled variables, full results published.

Seven of the 19 prompts produced few or no brand recommendations. Most of these were questions about destinations or supplement categories rather than products from named companies (“Which of the islands just south of Tokyo are most populated”, “Top vitamins for over 55 years in age”). That is itself a useful finding (some prompts simply don't surface branded recommendations), but it makes them unsuitable for measuring variance in brand recommendations, so the rest of the analysis is on the 12 prompts that did surface brands. That is 120 individual responses.

Headline numbers

Across the 12 prompts:

Mean pairwise Jaccard similarity: 0.55. For any two runs of the same prompt, the brand sets share just over half their members on average. The other half are different brands.
27% of recommended brands are “ghosts.” They appear in exactly one of the ten runs and never again.
Only 16% of recommended brands are stable. These are the brands that show up in all ten runs.
57% are in the unstable middle: they show up in two to nine of the ten runs.

The mention rate distribution makes the shape clear:

The 1/10 bucket (ghost brands) and the 10/10 bucket (stable brands) are both bigger than any middle bucket. AI search recommendations look more like sampling from a candidate pool than a settled ranking.

The most common bucket is 1 of 10 (47 brand-prompt cells), and the second most common is 10 of 10 (28). Almost every other run frequency is well represented. The brand graveyard and the brand winners are both bigger than any middle bucket, which is what you'd expect if AI search is doing something more like sampling from a candidate pool than producing a settled ranking.

Three concrete examples

To make this less abstract, here are three of the 12 prompts.

“what is the best phone” - Jaccard 0.90

The most stable prompt in the set. Apple, Samsung, and Google show up in every single run. Two ghosts (ASUS once, OnePlus once) appear and disappear in the margins, but the core triopoly is locked in.

Brand	r1	r2	r3	r4	r5	r6	r7	r8	r9	r10	n/10
Apple	●	●	●	●	●	●	●	●	●	●	10/10
Google	●	●	●	●	●	●	●	●	●	●	10/10
Samsung	●	●	●	●	●	●	●	●	●	●	10/10
OnePlus	·	·	·	·	●	·	·	·	·	·	1/10
ASUS	·	·	·	·	·	·	●	·	·	·	1/10

This is what most marketing teams imagine AI search looks like. A small, stable set of incumbents. It is the exception, not the rule.

“give me a configuration for a good gaming pc” - Jaccard 0.66

AMD and NVIDIA are anchored at 10 of 10. So are Corsair, Fractal Design, Lian Li, and NZXT (the case and cooling end of the market is more concentrated than you might think). But six other brands appear in some runs and not others: MSI in 8 of 10, ASUS in 6, Noctua in 6, DeepCool in 4, EVGA in 4, and a long tail of Samsung, WD, Thermalright, Gigabyte, Arctic.

A vendor of cooling components would look at this and conclude very different things depending on which run they happened to see. One screenshot might surface them. The next one might not. In one of the ten runs, neither Noctua nor DeepCool shows up at all. In another, both do.

“Give me the top 10 highest yield savings accounts available right now” - Jaccard 0.40

Twenty-six different banks were recommended across ten runs. Only SoFi appeared in all ten. Ally, Marcus, and Synchrony showed up in eight or nine. After that, a long, busy middle: Western Alliance Bank in 7 of 10, CIT Bank in 6, Discover in 6, American Express in 6, Bread Savings in 5, Newtek Bank in 4. Then a graveyard of one-time appearances: EverBank, Flagstar, Jenius Bank, My Banking Direct, Poppy Bank, Popular Direct, TAB Bank, Varo.

If you are a head of growth at Newtek Bank, your mention rate is 40%. If you are at Jenius Bank, your mention rate is 10%. A single screenshot will tell you almost nothing about which of those two you are.

By category

Some categories are dramatically more volatile than others.

Distribution of per-prompt mean pairwise Jaccard across the 12 prompts. Mean = 0.55. The heaviest cluster sits between 0.3 and 0.5 - the kind of overlap you'd be uncomfortable calling reproducible.

Consumer electronics is the most stable category in our sample (mean Jaccard 0.65, 30% of brands stable across all runs). Wellness and travel are in the middle (Jaccard 0.61 and 0.59). Fintech and banking is the least stable (Jaccard 0.38, with just 2% of brands stable across all 10 runs).

In every category, a meaningful share of brands appears in only one of ten runs — Share of recommended brands per category that are stable (10/10), middle (2–9/10), or ghosts (1/10). Fintech is dominated by inconsistency and ghosts; consumer electronics has the largest stable core.

The pattern lines up with how concentrated each category is in the real world. Phones are a triopoly. PC components have a handful of dominant vendors. High yield savings is fragmented across dozens of mid-sized digital banks competing for the same SEO real estate, and the model rotates through them. Stability of AI search recommendations appears to be a function of how concentrated the underlying market is, which is exactly what you'd expect if the model is drawing from web sources that themselves rotate their recommendations. (We've written separately about how to find those underlying sub-searches.)

What this means in practice

A few things follow from this if you care about how your brand shows up in AI search.

A brand's “presence” is a rate, not a binary. Asking “does ChatGPT mention us” is asking the wrong question. The right question is “what share of runs mention us, and how does that compare to competitors.” Apple is mentioned in 100% of “best phone” runs. Newtek is mentioned in 40% of “best HYSA” runs. Both are real. Neither is captured by a single screenshot.

A single test is a sample of size one. If you only see one run and your brand is there, you don't know if you're at 100% or 20%. If you only see one run and you aren't there, you don't know if you're at 0% or 80% with a missed coin flip. The variance in our data is large enough that a single test gives almost no information about the underlying rate.

The right minimum is ten runs per prompt, not three. With ten runs, mention rates resolve clearly. The brands you'd actually want to defend are the ones in the 8–10 of 10 band. The ones you'd want to monitor are in the 2–7 of 10 band, where small changes in your content or visibility could move the rate up or down. The ghosts (1 of 10) are usually noise but not always; some are real edge-case competitors who got mentioned because of a particular search result.

Variance is highest where the market is most fragmented. If you operate in a concentrated category like phones, GPUs, or major hotel chains, your visibility is probably either solid or essentially absent, and one good audit will tell you which. If you operate in a fragmented category like high yield savings, supplements outside the household brands, or boutique hotels, you need a rate, not a snapshot. The single screenshot will mislead you in either direction.

Caveats and limits

This is a small experiment. 19 prompts, 4 categories, English only, US location, one model, one snapshot in time. We didn't run a temperature-zero control, so some of the variance we see could be reduced by parameter tweaks (though the prompts that real users send don't come with temperature settings). And we measured brand recommendations specifically: there are other things you might want to track in AI search responses, like factual claims about a brand, sentiment, or which sources the model cites.

What this experiment can tell you, defensibly: a single ChatGPT run is not a measurement. It is one observation drawn from a distribution that, for most prompts that surface brand recommendations, is wider than the marketing tooling currently sold around AI search visibility implicitly assumes.

If you've been auditing AI search by taking one screenshot, you're not measuring what you think you are. Aiso's brand visibility platform runs ten replicates per prompt by default for exactly this reason - it's the floor below which the data isn't strong enough to make decisions on.