The first time we ran a presence-rate test against ChatGPT API for a client, we got a reading of 0.8. We ran the same prompt against the same model thirty seconds later and got 0.2. We then watched a real customer of theirs open ChatGPT and ask roughly the same thing during the same hour. The brand was mentioned in the answer.
Same prompt. Same model. Same morning. Three answers, three different stories. That is the problem every AI-visibility tool has to solve and the reason a BrandCrux scan does what looks like too much work.
The math first
From the screenshot above: 40 tracked prompts, four AI engines, 160 completed executions, 830 readings. The 160 is easy: 40 prompts × 4 engines = 160 (prompt, engine) cells. The 830 is the part we get asked about.
Each cell is sampled multiple times. On the default confidence setting we shoot for 5 readings per cell. So 160 × 5 = 800 planned. The remaining 30 are retries on calls that came back rate-limited, truncated, or with a schema we could not parse. The screenshot caught a scan where the retries cost us 30 extra readings; on a Sunday afternoon when the engines are quiet we have seen 5.
Why we sample more than once per cell
Large language models are not deterministic at the API level. ChatGPT API at temperature 1.0 (the default for OpenAI's recommendation traffic) draws from a sampling distribution, so the same prompt-and-system-context can name your brand on one call and skip it on the next. The variance is largest exactly where it matters most: for brands that are borderline mentions. A brand that AI never mentions returns presence 0/N every time. A brand AI almost always mentions returns N/N. The interesting middle, where someone is on the cusp of breaking through, is where one sample tells you almost nothing.
We tested this with a labelled set of forty Indian banks across the four engines. A single reading per (prompt, engine) cell agreed with the five-reading consensus only 61% of the time on borderline cells. Three readings got us to 84%. Five readings got us to 92%. Going beyond five gave us diminishing returns: seven readings only added another two points and cost 40% more credits.
That is why five is the default. It is the smallest sample size that produces a presence rate you can sign your name to.
Why we run all four engines, every time
Engines disagree more than people think. Claude Opus has its own pre-training cutoff and its own bias toward the sources it has indexed; Gemini leans on Google's knowledge graph for entity grounding; Perplexity follows live web citations; OpenAI's GPT family weighs whatever combination of pre-training and search snippets the variant happens to use.
On the same forty-bank dataset, the engine with the highest disagreement against the four-engine average was Claude, which under-mentioned smaller private-sector banks by about 12 points. The smallest disagreement was ChatGPT API, which sat within 4 points of the average on most prompts. None of the four engines was a good standalone proxy. If you scanned just Claude you would have undersold a real-traffic brand by a full grade.
So we do the work of running all four. The cost is a credit per call. The reward is one number you can defend.
What "readings" rolls up to
A reading is a single API call to one engine for one prompt. We store it raw, with the engine name, the completion text, the cited URLs, the inferred sentiment, and the timestamp. The downstream pipeline computes:
- Presence rate for that (prompt, engine) cell, with a Wilson confidence interval that widens when the sample is small.
- Engine score for the prompt: the per-cell presence rates averaged across the engines you chose to test, weighted by your active-engine config.
- AI Visibility score for the topic: every prompt mapped to its authority topic, scores averaged within the topic.
- Citation refs: every URL the engines cited across all 830 readings, deduped and attributed back to the prompts that surfaced them.
The score you read on the dashboard is the top of a pyramid that has 830 raw measurements underneath it. The dashboard never lies to you with a single sample.
What we deliberately do not do
We do not run forty samples per cell. Some tools we have looked at do; the marginal information past five is not worth the credit bill, and we would rather spend that budget running more distinct prompts than re-sampling the same one.
We also do not silently swap engines when one is rate limited. If Perplexity is down for an hour, the scan surfaces it as failed cells in that row, not as a quietly-filled-in placeholder. The credit estimate on the approval screen accounts for the four engines you picked; a partial scan refunds the unused budget.
Why this matters for your bill
A 40-prompt scan against four engines on default confidence spends roughly 800 credits. Bump to high confidence (7x per cell) and you are at 1,120 credits. Drop to two engines and you halve both numbers. The credit-approval card on every scan run shows the exact breakdown before you click Run, so you can pick the confidence level that fits the question. Quarterly board report? Run high. Daily health check? Run low and trust the trend over the absolute value.
Five readings × four engines × forty prompts × a few retries is what good measurement of a non-deterministic system looks like. The 830 number on the scan card is not waste. It is the smallest number we found that lets us tell you something true.
