Updated April 2026
TL;DR:
Let me back up. Most AI visibility audits fail not from measurement error but from expectation mismatch. A founder runs the audit hoping it will tell them why their organic revenue is down. It will not. It tells them whether their brand shows up when an AI assistant is asked a category-defining question, and where they sit in the citation ranking. That's it.
Most teams I've talked to don't run their second audit. Either the first one didn't surface anything actionable, or the cadence collapsed under the rest of their workload. They run 30 prompts once, see a single number ("we appear in 23% of responses"), and have no idea what to do with it. Was 23% good? Did Perplexity and ChatGPT agree? They don't know — a one-shot audit is the wrong unit of analysis.
Tim Soulo (CMO at Ahrefs) put it cleanly in his Feb 2026 roundup: AI visibility tracking is "still very early-stage" and brands building processes around it now have a first-mover window. He's right, but I'd add the unglamorous version: the methodology is also still wobbly. If you want a number you can defend in a board meeting, you need to know what the audit can and cannot prove.
What it answers: whether you're cited in AI responses for category queries, how often relative to competitors, where (first mention, mid-list, footnote), the sentiment, and which engines treat you well versus badly. What it does not answer: revenue causation, share of attention vs share of citation, churn risk, or whether AI search will be 5% or 50% of your funnel by 2027. Anyone selling you a tool that claims to answer those is selling a forecast dressed as a measurement.
Five columns. If your audit captures these and nothing else, you're 90% of the way there.
| Metric | Formula | What it tells you |
|---|---|---|
| Citation Rate | (Prompts where brand appears / Total prompts) × 100 | Baseline visibility. The "do they know I exist" number. |
| Share of Model | (Your mentions / Total brand mentions in response set) × 100 | Competitive position. Your slice of the pie when the pie is "all brands the model named". |
| Prominence Score | Weighted: first-mention=3, listed=2, mentioned=1, averaged across appearances | Quality of citation. Being mentioned third in a list of seven is not the same as being the headline answer. |
| Sentiment Polarity | +1 (positive) / 0 (neutral) / -1 (negative), averaged | How the model frames you when it cites you. |
| Citation Source URL | The URL the model attributes the claim to (when given) | Which of your pages (or competitors' pages) is feeding the model. |
A note on sentiment baselines. Spotlight's analysis of 1.8 million AI responses (Feb 2026) found 80.6% of mentions are neutral, 18.4% positive, only 1% negative. Wildly different from product-review sentiment, where negativity skews much higher. The implication: if your sentiment polarity sits around +0.18, you're at the platform median. People panic about neutral mentions. They shouldn't. Neutral is the baseline.
Citation Source URL is the metric most teams skip and the one I'd argue is the most actionable. If the model cites a Reddit thread instead of your homepage, that's a retrieval signal you can fix. If it cites a competitor's comparison page where they bury you, that's a content gap. The number tells you visibility; the URL tells you why.
Most audit guides hand-wave this step. "Build 30 prompts." Cool. Which 30? In what proportion? Phrased how?
The structure that works is 10 informational, 10 comparison, 10 high-intent. Below 20 prompts is statistical noise; above 60 hits diminishing returns. Thirty is the sweet spot.
Worked example for a B2B SaaS sending transactional email (a real category I audited with a customer last month, anonymized):
Voice matters. Write prompts the way an actual user would type them, not the way an SEO would phrase a keyword. "Transactional email API alternatives 2026 best" is keyword salad. No human types that. AI assistants ingest training data written by humans, so your prompts must mimic real user voice. (Side note: I keep a Google Doc of prompts pulled verbatim from customer support tickets. Cleanest source of "what does a real user actually ask" I've found.)
Required four: ChatGPT, Perplexity, Google AI Mode (and AI Overviews), Gemini. That's most of the public-facing surface area in 2026.
| Engine | Run it? | Why |
|---|---|---|
| ChatGPT | Always | Largest consumer surface; "frozen brand list" effect makes it the hardest to crack |
| Perplexity | Always | Fresh retrieval per query; surfaces niche brands; flattering numbers but useful diagnostics |
| Google AI Mode + Overviews | Always | Closest tie to traditional Google rankings; biggest organic-traffic substitution risk |
| Gemini | Always | Important for any brand whose buyers live in Workspace |
| Claude | Enterprise only | Small consumer surface; noisy data unless you sell into Claude-for-Business orgs |
| Grok / DeepSeek | Skip by default | Audience-specific; only run if you can articulate why beyond "trend" |
One thing I keep saying to customers: from audits we've run through SEOJuice's AI Visibility Checker, the most common surprise teams report is that their Perplexity citation rate beats their ChatGPT rate by multiples — even when their ChatGPT-rank-tracking dashboards say otherwise. The mistake everyone makes is treating ChatGPT as the proxy for all AI search. It isn't. Perplexity surfaces niche brands far more aggressively because its retrieval pulls fresh web content per query, while ChatGPT's training cut creates a "frozen brand list" effect for anything outside the top 20 in a category.
If you only have time to run one engine in week one, run Perplexity. You'll see your most flattering numbers there, which sounds bad but is useful: it tells you whether the retrieval pipeline can find you at all.
This is the section I'd put in bold if I could only keep one.
Rand Fishkin and Patrick O'Donnell ran an experiment in early 2026: same 12 prompts, 2,961 runs across major AI assistants. The finding (published on SparkToro): the probability of two responses producing the exact same ordered list of brands was less than 1 in 1,000. Less than 1 in 1,000. Same prompt, same model, minutes apart.
If you run a prompt once and write down what you saw, you have not measured your visibility. You have measured one Monte Carlo draw from a distribution you don't yet understand. The audit you publish based on that single draw is wrong because you stopped sampling too early.
The fix is N=5 minimum. Five runs per prompt, on different days, fresh sessions, cleared cookies. (I should mention: I'd love to recommend N=10 but most teams won't do it, and N=5 is enough to stabilize the headline metrics within ~10% relative error based on what I've seen.) Total: 30 × 5 × 4 = 600 data points. Sounds like a lot. With a checklist and a spreadsheet, it's about four hours.
Run sessions on different days, not different hours of the same day. Models cache aggressively at the inference layer. Two runs 10 minutes apart can return identical responses for caching reasons unrelated to the actual probability distribution. Day-spaced runs sample more fairly.
The operational details are where every other guide hand-waves. Here's the actual sequence.
Pre-flight (15 min). Open four incognito tabs: ChatGPT, Perplexity, Google with AI Mode enabled, Gemini. Logged-in ChatGPT remembers you've asked about your own brand 40 times and will lean toward you in ways the average user's session won't. Incognito for everything.
Spreadsheet schema. Nine columns: prompt_id (P01-P30), prompt_text, tier, engine, run_number (1-5), brand_appeared (1/0), position (integer or null), sentiment (+1/0/-1), cited_url. Don't get fancy. Filled rows look like this:
| prompt_id | tier | engine | run | appeared | position | sentiment | cited_url |
|---|---|---|---|---|---|---|---|
| P03 | informational | Perplexity | 1 | 1 | 2 | 0 | acme.com/guide |
| P03 | informational | ChatGPT | 1 | 0 | — | — | — |
| P11 | comparison | Perplexity | 2 | 1 | 1 | +1 | reddit.com/r/saas/... |
| P11 | comparison | Google AI Mode | 2 | 1 | 4 | 0 | g2.com/categories/... |
| P24 | high-intent | Gemini | 3 | 1 | 3 | -1 | competitor.com/vs-acme |
Execution loop. For each prompt, run on engine 1, copy the response, fill the row, move to engine 2. Do all four engines for prompt 1 before moving to prompt 2. Cross-engine comparison is the most useful read; cluster, don't scatter.
Coding sentiment. +1 if the response recommends you or lists you among "the best". 0 if it just names you with no qualifier. -1 if it says you're worse than alternatives or warns the reader away. Most rows will be 0. That's normal (don't over-rotate on neutral mentions).
Position rules. Position 1 is the first brand named in the response body. Listed as item 4 in a "top 10" list, position is 4. Mentioned twice, take the better position. Citation URL: Perplexity gives these directly; ChatGPT only when browsing was used; Google AI Mode shows source cards; Gemini varies. Record when given. Don't infer.
Four hours, end to end, for a 30 × 5 × 4 audit. I've timed it. The first hour is slowest while you tune your sentiment-coding intuition; runs 3 onward are mechanical.
Numbers on a fictional brand, Acme Analytics (Acme is fake; the structure is what I see in real audits).
Citation Rate. Across 30 × 5 × 4 = 600 measurements, brand appeared in 138. Citation Rate = 23%. For a B2B SaaS in a competitive category, 15-30% is normal first-audit territory.
Share of Model. In 138 responses, each named 7 distinct brands on average. Total brand-mentions: 966. Share of Model = 138 / 966 = 14.3%. When the model includes Acme, Acme is one of seven brands named.
Prominence Score. 21 first-mentions (×3), 67 listed (×2), 50 mentioned (×1). Weighted total 247, divided by 138 = 1.79. Closer to "listed in the middle" than "the headline answer". This is where the most actionable feedback lives. I used to weight first-mention 3× and stop there. The teams who actually moved their visibility numbers were the ones tracking the next-mention positions too — first-mention can be a paid co-marketing artifact, while consistent presence in positions 2-4 is genuine authority.
Sentiment Polarity. +25 - 4 = +21 across appearances. Average = +0.15. Slightly positive, just below the platform median of +0.18 (Spotlight 2026). Not bad. Not great.
If you want one number, multiply Citation Rate by Sentiment Weighted Visibility. Acme: 23% × (1 + 0.15) = 26.5. No industry benchmark for this composite yet; the methodology is six months old. Your audit-over-audit trend is more meaningful than any absolute threshold.
Numbers without diagnosis are vanity. The triage framework I use splits gaps into three buckets, and each has a different fix that takes a different amount of time.
Foundational gap. Absent across all platforms on informational prompts. The model has not learned you exist as a category-relevant entity. Symptom: Citation Rate near zero on tier 1, occasional appearances on tier 3 only when your brand name is in the prompt. Fix: off-site authority — digital PR in publications the model crawls, Wikipedia presence, Reddit mentions in active subreddits. 60-90 day program. Our multisource SEO guide has the full off-site playbook.
Platform-specific gap. Present on Perplexity, missing on ChatGPT (or vice versa). The model has heard of you but the retrieval layer can't find you reliably. Symptom: 5x or more Citation Rate split between engines. Fix: retrieval signal repair — schema markup (Organization, Product, FAQ), llms.txt, server-side rendering so non-JS crawlers (GPTBot, PerplexityBot) can read your pages. The AI crawler playbook covers the access-side fixes.
Prominence gap. Mentioned, but always at position 4-7. The model knows you matter; it doesn't think you're the headline answer. Fix: comparison content and first-mention positioning. Publish "X vs you" pages where you control the framing. Build the canonical "best [category] tools" listicle on your own domain (with honest competitor coverage). 30-60 day fix; the most common gap I see.
Don't treat all three the same way. Foundational gaps don't get fixed by schema. Platform gaps don't get fixed by digital PR. Triage first, then fix.
Once is a snapshot. Twice is a trend. Five times is a process. The cadence I run with customers and use on SEOJuice itself:
Category meaning drifts in AI training data faster than you'd expect. "AI SEO tool" meant something different in 2024 than it does in 2026. If your prompt bank is six months old and you've only been refreshing the runs, you're measuring against a stale concept of your category. The quarterly rebuild is non-negotiable.
Track audit-over-audit deltas, not absolute numbers. "Our Citation Rate moved from 23% to 28% over Q1" is a real signal. "Our Citation Rate is 23%" alone tells you nothing because there's no industry benchmark with enough data yet. Your past self is the benchmark.
The limits matter as much as the methodology. If I sell you on the audit and you discover the limits later, you'll feel cheated. So, in plain terms:
If running the audit manually feels like overhead, the AI Visibility Checker automates the prompt runs, the variance sampling, and the scorecard, using the same methodology described above. The point of this article is the methodology, not the tool. If you want to do it in a spreadsheet, the spreadsheet works.
Monthly for the full 30-prompt bank, weekly for your top 5 stakes prompts. Quarterly, rebuild the prompt bank from scratch because user search behavior and category language drift faster than the runs detect.
No. A spreadsheet, four browser tabs (ChatGPT, Perplexity, Google AI Mode, Gemini), and four hours of focused work will give you a usable audit. Tools save time on variance sampling and scoring, but the methodology is the methodology either way.
There is no industry-wide benchmark yet. For B2B SaaS in a mid-competitive category, 15-30% is typical first-audit territory. The trend over time matters more than the absolute number. If you moved from 18% to 26% in a quarter, you're winning.
Perplexity does fresh retrieval per query and surfaces newer or smaller brands more easily. ChatGPT relies more on training-time signals, which creates a "frozen brand list" effect for anything outside the top 20 in a category. This is a platform-specific gap; the fix is retrieval signals (schema, llms.txt, server-side rendering), not more PR.
A regular SEO audit measures how well search engines crawl, render, and rank your pages. An AI visibility audit measures whether LLM assistants cite your brand when answering category questions. Different signals, different metrics, different problems. You need both. The shift from SEO to GEO covers the conceptual difference.
Related reading:
If you'd rather skip the spreadsheet, try the AI Visibility Checker. Same methodology, same metrics, automated.
no credit card required