How to Run an AI Visibility Audit in 2026: A Step-by-Step Methodology for Measuring Brand Citations Across ChatGPT, Perplexity & Google AI Mode

Updated April 2026

TL;DR:

An AI visibility audit measures brand citation rate across ChatGPT, Perplexity, Google AI Mode, Claude, and Gemini. It does not measure revenue, click-through, or churn impact.
One prompt run is statistically a lie. Gumshoe.ai found the chance of two responses producing the same ordered brand list was less than 1 in 1,000 across 2,961 runs. Run each prompt N=5, on different days, fresh sessions.
The four numbers that matter: Citation Rate, Share of Model, Prominence Score, Sentiment polarity. Industry baseline (Spotlight, Feb 2026): 80.6% of mentions are neutral, 1% negative.
Three gap types, three fixes. Foundational = PR problem. Platform-specific = retrieval problem. Prominence = positioning problem. Lumping them together is why most teams quit after audit #1.
You can run a usable audit by Monday with a spreadsheet, 30 prompts, and four browser tabs. Or use the AI Visibility Checker. Same methodology either way.

What an AI visibility audit actually answers (and what it doesn't)

Let me back up. Most AI visibility audits fail not from measurement error but from expectation mismatch. A founder runs the audit hoping it will tell them why their organic revenue is down. It will not. It tells them whether their brand shows up when an AI assistant is asked a category-defining question, and where they sit in the citation ranking. That's it.

Most teams I've talked to don't run their second audit. Either the first one didn't surface anything actionable, or the cadence collapsed under the rest of their workload. They run 30 prompts once, see a single number ("we appear in 23% of responses"), and have no idea what to do with it. Was 23% good? Did Perplexity and ChatGPT agree? They don't know — a one-shot audit is the wrong unit of analysis.

Tim Soulo (CMO at Ahrefs) put it cleanly in his Feb 2026 roundup: AI visibility tracking is "still very early-stage" and brands building processes around it now have a first-mover window. He's right, but I'd add the unglamorous version: the methodology is also still wobbly. If you want a number you can defend in a board meeting, you need to know what the audit can and cannot prove.

What it answers: whether you're cited in AI responses for category queries, how often relative to competitors, where (first mention, mid-list, footnote), the sentiment, and which engines treat you well versus badly. What it does not answer: revenue causation, share of attention vs share of citation, churn risk, or whether AI search will be 5% or 50% of your funnel by 2027. Anyone selling you a tool that claims to answer those is selling a forecast dressed as a measurement.

The 5 metrics every audit must capture

Five columns. If your audit captures these and nothing else, you're 90% of the way there.

Metric	Formula	What it tells you
Citation Rate	(Prompts where brand appears / Total prompts) × 100	Baseline visibility. The "do they know I exist" number.
Share of Model	(Your mentions / Total brand mentions in response set) × 100	Competitive position. Your slice of the pie when the pie is "all brands the model named".
Prominence Score	Weighted: first-mention=3, listed=2, mentioned=1, averaged across appearances	Quality of citation. Being mentioned third in a list of seven is not the same as being the headline answer.
Sentiment Polarity	+1 (positive) / 0 (neutral) / -1 (negative), averaged	How the model frames you when it cites you.
Citation Source URL	The URL the model attributes the claim to (when given)	Which of your pages (or competitors' pages) is feeding the model.

A note on sentiment baselines. Spotlight's analysis of 1.8 million AI responses (Feb 2026) found 80.6% of mentions are neutral, 18.4% positive, only 1% negative. Wildly different from product-review sentiment, where negativity skews much higher. The implication: if your sentiment polarity sits around +0.18, you're at the platform median. People panic about neutral mentions. They shouldn't. Neutral is the baseline.

Visual formula card showing Citation Rate equals prompts with mention divided by total prompts times 100, with a 30-prompt grid where 8 prompts are highlighted indicating 26.7 percent — How Citation Rate works in practice: 8 of 30 prompts mention the brand, so Citation Rate = 26.7%. Source: SEOJuice methodology.

Citation Source URL is the metric most teams skip and the one I'd argue is the most actionable. If the model cites a Reddit thread instead of your homepage, that's a retrieval signal you can fix. If it cites a competitor's comparison page where they bury you, that's a content gap. The number tells you visibility; the URL tells you why.

Building a 30-prompt bank: the 10/10/10 framework

Most audit guides hand-wave this step. "Build 30 prompts." Cool. Which 30? In what proportion? Phrased how?

The structure that works is 10 informational, 10 comparison, 10 high-intent. Below 20 prompts is statistical noise; above 60 hits diminishing returns. Thirty is the sweet spot.

10 informational: category-defining, no brand named. "What's the best way to do X?" Tells you whether the model thinks of you unprompted. Hardest tier to crack, most valuable.
10 comparison: head-to-head. "X vs Y for [use case]." "Alternatives to [competitor]." Tells you whether you're in the competitive set the model considers.
10 high-intent: purchase-stage. "Best [tool] under $50/month." Closer to bottom-of-funnel. Citation here matters more for revenue, less for general awareness.

Worked example for a B2B SaaS sending transactional email (a real category I audited with a customer last month, anonymized):

Informational: "How do small SaaS teams handle transactional email reliability?"
Informational: "What does a deliverability stack look like in 2026?"
Comparison: "Postmark vs Resend for a Series A startup."
Comparison: "Alternatives to SendGrid for transactional email."
High-intent: "Best transactional email API under $50/month for a 50k-user SaaS."
High-intent: "Transactional email service with the best React Email support."

Voice matters. Write prompts the way an actual user would type them, not the way an SEO would phrase a keyword. "Transactional email API alternatives 2026 best" is keyword salad. No human types that. AI assistants ingest training data written by humans, so your prompts must mimic real user voice. (Side note: I keep a Google Doc of prompts pulled verbatim from customer support tickets. Cleanest source of "what does a real user actually ask" I've found.)

Three-column diagram showing the 10/10/10 prompt framework with informational comparison and high-intent categories color coded gray blue and green with example prompts under each — The 10/10/10 framework: balanced coverage across informational, comparison, and high-intent prompts. Source: SEOJuice methodology.

Which AI engines to test (and which to skip)

Required four: ChatGPT, Perplexity, Google AI Mode (and AI Overviews), Gemini. That's most of the public-facing surface area in 2026.

Engine	Run it?	Why
ChatGPT	Always	Largest consumer surface; "frozen brand list" effect makes it the hardest to crack
Perplexity	Always	Fresh retrieval per query; surfaces niche brands; flattering numbers but useful diagnostics
Google AI Mode + Overviews	Always	Closest tie to traditional Google rankings; biggest organic-traffic substitution risk
Gemini	Always	Important for any brand whose buyers live in Workspace
Claude	Enterprise only	Small consumer surface; noisy data unless you sell into Claude-for-Business orgs
Grok / DeepSeek	Skip by default	Audience-specific; only run if you can articulate why beyond "trend"

One thing I keep saying to customers: from audits we've run through SEOJuice's AI Visibility Checker, the most common surprise teams report is that their Perplexity citation rate beats their ChatGPT rate by multiples — even when their ChatGPT-rank-tracking dashboards say otherwise. The mistake everyone makes is treating ChatGPT as the proxy for all AI search. It isn't. Perplexity surfaces niche brands far more aggressively because its retrieval pulls fresh web content per query, while ChatGPT's training cut creates a "frozen brand list" effect for anything outside the top 20 in a category.

If you only have time to run one engine in week one, run Perplexity. You'll see your most flattering numbers there, which sounds bad but is useful: it tells you whether the retrieval pipeline can find you at all.

The variance problem: why one prompt run is a lie

This is the section I'd put in bold if I could only keep one.

Rand Fishkin and Patrick O'Donnell ran an experiment in early 2026: same 12 prompts, 2,961 runs across major AI assistants. The finding (published on SparkToro): the probability of two responses producing the exact same ordered list of brands was less than 1 in 1,000. Less than 1 in 1,000. Same prompt, same model, minutes apart.

If you run a prompt once and write down what you saw, you have not measured your visibility. You have measured one Monte Carlo draw from a distribution you don't yet understand. The audit you publish based on that single draw is wrong because you stopped sampling too early.

The fix is N=5 minimum. Five runs per prompt, on different days, fresh sessions, cleared cookies. (I should mention: I'd love to recommend N=10 but most teams won't do it, and N=5 is enough to stabilize the headline metrics within ~10% relative error based on what I've seen.) Total: 30 × 5 × 4 = 600 data points. Sounds like a lot. With a checklist and a spreadsheet, it's about four hours.

Same prompt, five runs each, three engines. Brand appears 4× on Perplexity, 1× on ChatGPT, 3× on Google AI Mode. This variance is the rule, not the exception. Source: Gumshoe.ai-style replication.

Run sessions on different days, not different hours of the same day. Models cache aggressively at the inference layer. Two runs 10 minutes apart can return identical responses for caching reasons unrelated to the actual probability distribution. Day-spaced runs sample more fairly.

Running the audit manually: a Monday-morning playbook

The operational details are where every other guide hand-waves. Here's the actual sequence.

Pre-flight (15 min). Open four incognito tabs: ChatGPT, Perplexity, Google with AI Mode enabled, Gemini. Logged-in ChatGPT remembers you've asked about your own brand 40 times and will lean toward you in ways the average user's session won't. Incognito for everything.

Spreadsheet schema. Nine columns: prompt_id (P01-P30), prompt_text, tier, engine, run_number (1-5), brand_appeared (1/0), position (integer or null), sentiment (+1/0/-1), cited_url. Don't get fancy. Filled rows look like this:

prompt_id	tier	engine	run	appeared	position	sentiment	cited_url
P03	informational	Perplexity	1	1	2	0	acme.com/guide
P03	informational	ChatGPT	1	0	—	—	—
P11	comparison	Perplexity	2	1	1	+1	reddit.com/r/saas/...
P11	comparison	Google AI Mode	2	1	4	0	g2.com/categories/...
P24	high-intent	Gemini	3	1	3	-1	competitor.com/vs-acme

Execution loop. For each prompt, run on engine 1, copy the response, fill the row, move to engine 2. Do all four engines for prompt 1 before moving to prompt 2. Cross-engine comparison is the most useful read; cluster, don't scatter.

Coding sentiment. +1 if the response recommends you or lists you among "the best". 0 if it just names you with no qualifier. -1 if it says you're worse than alternatives or warns the reader away. Most rows will be 0. That's normal (don't over-rotate on neutral mentions).

Position rules. Position 1 is the first brand named in the response body. Listed as item 4 in a "top 10" list, position is 4. Mentioned twice, take the better position. Citation URL: Perplexity gives these directly; ChatGPT only when browsing was used; Google AI Mode shows source cards; Gemini varies. Record when given. Don't infer.

Four hours, end to end, for a 30 × 5 × 4 audit. I've timed it. The first hour is slowest while you tune your sentiment-coding intuition; runs 3 onward are mechanical.

Calculating your scores: the formulas in practice

Numbers on a fictional brand, Acme Analytics (Acme is fake; the structure is what I see in real audits).

Citation Rate. Across 30 × 5 × 4 = 600 measurements, brand appeared in 138. Citation Rate = 23%. For a B2B SaaS in a competitive category, 15-30% is normal first-audit territory.

Share of Model. In 138 responses, each named 7 distinct brands on average. Total brand-mentions: 966. Share of Model = 138 / 966 = 14.3%. When the model includes Acme, Acme is one of seven brands named.

Prominence Score. 21 first-mentions (×3), 67 listed (×2), 50 mentioned (×1). Weighted total 247, divided by 138 = 1.79. Closer to "listed in the middle" than "the headline answer". This is where the most actionable feedback lives. I used to weight first-mention 3× and stop there. The teams who actually moved their visibility numbers were the ones tracking the next-mention positions too — first-mention can be a paid co-marketing artifact, while consistent presence in positions 2-4 is genuine authority.

Sentiment Polarity. +25 - 4 = +21 across appearances. Average = +0.15. Slightly positive, just below the platform median of +0.18 (Spotlight 2026). Not bad. Not great.

The four-metric scorecard: what a finished audit summary actually looks like. Source: SEOJuice scorecard template.

If you want one number, multiply Citation Rate by Sentiment Weighted Visibility. Acme: 23% × (1 + 0.15) = 26.5. No industry benchmark for this composite yet; the methodology is six months old. Your audit-over-audit trend is more meaningful than any absolute threshold.

Diagnosing citation gaps: foundational vs platform vs prominence

Numbers without diagnosis are vanity. The triage framework I use splits gaps into three buckets, and each has a different fix that takes a different amount of time.

Decision tree starting with the question are you mentioned then branching to foundational gap platform-specific gap or prominence gap each with a colored 30-day fix arrow — The citation gap triage tree: three branches, three fix paths. Source: SEOJuice methodology.

Foundational gap. Absent across all platforms on informational prompts. The model has not learned you exist as a category-relevant entity. Symptom: Citation Rate near zero on tier 1, occasional appearances on tier 3 only when your brand name is in the prompt. Fix: off-site authority — digital PR in publications the model crawls, Wikipedia presence, Reddit mentions in active subreddits. 60-90 day program. Our multisource SEO guide has the full off-site playbook.

Platform-specific gap. Present on Perplexity, missing on ChatGPT (or vice versa). The model has heard of you but the retrieval layer can't find you reliably. Symptom: 5x or more Citation Rate split between engines. Fix: retrieval signal repair — schema markup (Organization, Product, FAQ), llms.txt, server-side rendering so non-JS crawlers (GPTBot, PerplexityBot) can read your pages. The AI crawler playbook covers the access-side fixes.

Prominence gap. Mentioned, but always at position 4-7. The model knows you matter; it doesn't think you're the headline answer. Fix: comparison content and first-mention positioning. Publish "X vs you" pages where you control the framing. Build the canonical "best [category] tools" listicle on your own domain (with honest competitor coverage). 30-60 day fix; the most common gap I see.

Don't treat all three the same way. Foundational gaps don't get fixed by schema. Platform gaps don't get fixed by digital PR. Triage first, then fix.

What to do after the first audit: the monthly cadence

Once is a snapshot. Twice is a trend. Five times is a process. The cadence I run with customers and use on SEOJuice itself:

Weekly: the 5 highest-stakes prompts only — your "name your category and top three competitors" prompts. Detect drift fast.
Monthly: full 30-prompt bank, full N=5 sampling. The headline audit cadence.
Quarterly: rebuild the prompt bank. User search behavior shifts. Prompts that mattered in January don't matter in April.

Category meaning drifts in AI training data faster than you'd expect. "AI SEO tool" meant something different in 2024 than it does in 2026. If your prompt bank is six months old and you've only been refreshing the runs, you're measuring against a stale concept of your category. The quarterly rebuild is non-negotiable.

Track audit-over-audit deltas, not absolute numbers. "Our Citation Rate moved from 23% to 28% over Q1" is a real signal. "Our Citation Rate is 23%" alone tells you nothing because there's no industry benchmark with enough data yet. Your past self is the benchmark.

Honest limits: what AI visibility auditing can't tell you

The limits matter as much as the methodology. If I sell you on the audit and you discover the limits later, you'll feel cheated. So, in plain terms:

The audit cannot tell you revenue causation. AI citation correlates with mind-share and might correlate with conversions, but the causal chain is unmeasured. Google AI Overviews data shows traffic redistribution, not revenue impact.
It cannot tell you churn impact. No documented case of a brand's churn rising because of low AI visibility, because the data doesn't exist yet. Anyone claiming otherwise is forecasting, not measuring.
It cannot distinguish share of attention from share of citation. A brand cited in 1% of prompts but quoted at length has different real-world impact than a brand cited in 30% with one-line listings. Prominence Score helps but doesn't fully resolve this.
It cannot resolve "is AI killing SEO." The honest read of the data today: AI search is reallocating share of attention, not erasing the channel.

If running the audit manually feels like overhead, the AI Visibility Checker automates the prompt runs, the variance sampling, and the scorecard, using the same methodology described above. The point of this article is the methodology, not the tool. If you want to do it in a spreadsheet, the spreadsheet works.

Frequently asked questions

How often should I run an AI visibility audit?

Monthly for the full 30-prompt bank, weekly for your top 5 stakes prompts. Quarterly, rebuild the prompt bank from scratch because user search behavior and category language drift faster than the runs detect.

Do I need a tool to run an AI visibility audit?

No. A spreadsheet, four browser tabs (ChatGPT, Perplexity, Google AI Mode, Gemini), and four hours of focused work will give you a usable audit. Tools save time on variance sampling and scoring, but the methodology is the methodology either way.

What's a good Citation Rate?

There is no industry-wide benchmark yet. For B2B SaaS in a mid-competitive category, 15-30% is typical first-audit territory. The trend over time matters more than the absolute number. If you moved from 18% to 26% in a quarter, you're winning.

Why does my brand appear on Perplexity but not ChatGPT?

Perplexity does fresh retrieval per query and surfaces newer or smaller brands more easily. ChatGPT relies more on training-time signals, which creates a "frozen brand list" effect for anything outside the top 20 in a category. This is a platform-specific gap; the fix is retrieval signals (schema, llms.txt, server-side rendering), not more PR.

How is an AI visibility audit different from a regular SEO audit?

A regular SEO audit measures how well search engines crawl, render, and rank your pages. An AI visibility audit measures whether LLM assistants cite your brand when answering category questions. Different signals, different metrics, different problems. You need both. The shift from SEO to GEO covers the conceptual difference.

Related reading:

If you'd rather skip the spreadsheet, try the AI Visibility Checker. Same methodology, same metrics, automated.

Features

Start boosting your SEO today

Resources

Educate yourself

How to Run an AI Visibility Audit in 2026