seojuice

AI Crawler Playbook 2025: How to Identify and Win Traffic from AI Bots

Vadim Kravcenko
Vadim Kravcenko
Jul 18, 2025 · 4 min read

TL;DR: AI bots now pull roughly 4% of all HTML requests on the web, close to what Googlebot pulls, and the roster crawling you changes every few weeks. Find them in your logs, decide per-bot whether to allow or block in robots.txt, and structure pages so language models can quote you cleanly. Treat it as a quarterly chore, not a one-time fix.

I asked ChatGPT to recommend an SEO tool last spring. We weren't on the list. Three competitors were, including one I'd never heard of with a fraction of our backlinks. That stung, and it also told me exactly where the next fight is. It isn't blue-link rankings anymore. It's whether GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and the two dozen bots behind them decide your paragraph is worth repeating.

The Number That Changed How I Read Server Logs

Here's the stat that reorganised my mental model of who's crawling us. Cloudflare's 2025 Year in Review puts AI bots (everything except Googlebot) at an average of 4.2% of HTML requests across the web in 2025, bouncing between 2.4% and 6.4% month to month. Googlebot alone was about 4.5%.

Sit with that for a second. The combined AI crawler swarm now pulls almost as much HTML out of the web as Google's main spider does. For a decade I treated Googlebot as the robot, the one whose crawl budget I obsessed over, whose render path I debugged at 1am. It turns out there's a second Googlebot-sized appetite eating my pages, and until last year I wasn't watching it at all. OpenAI and Anthropic sit consistently in Cloudflare's top five most active.

(The first time I actually grepped our access log for AI user-agents, I assumed I'd find a trickle. GPTBot had hit us more times that week than Bingbot had all month. I'd been blind to a channel that was already bigger than one I'd spent years optimising for.)

That's the whole argument for paying attention, and it's why I structured this piece around one decision rather than a tour of the ecosystem. The decision is: which of these bots do you let in, and what do you feed them? Everything else (the directory, the log filters, the markup tips) is in service of getting that one call right.

Upfront caveat: a lot of this is still unsettled, including at SEOJuice. We've watched AI crawler behavior across customer sites since early 2025, and the data shifts month to month. Some of what's below is confirmed across hundreds of sites. Some is me reading server logs and timing correlations and making a call. I'll flag which is which.

What AI Crawlers Actually Want From You

Traditional search bots (Googlebot, Bingbot) visit your pages to decide how they rank. AI crawlers read your content to teach large language models how to answer questions. When GPTBot ingests your article, it isn't judging whether you deserve position 1. It's deciding whether your paragraph deserves to be quoted the next time someone asks ChatGPT for advice. Different distribution channel, different rules.

From what I can see across the sites in our AI visibility monitoring, the ones that welcome these bots and write content that's easy to parse do show up more often inside AI answers. I'm deliberately not putting a percentage on that. The methodology has holes (spot-check sampling, manual verification, selection bias from sites that opted into monitoring), so I'd be inventing precision I don't have. The direction is consistent; the magnitude is a guess.

Meanwhile most competitors are still staring at Search Console, not realising a meaningful slice of their server logs is LLM crawlers quietly indexing, or skipping, their expertise.

AI crawler ecosystem diagram showing GPTBot, ClaudeBot, PerplexityBot and others feeding into ChatGPT, Claude, Perplexity, and Google AI Overviews
Fig. 1 — The same HTML page feeds multiple parallel AI pipelines. Blocking one does not affect the others.

How big this channel gets, and how fast, nobody honestly knows. I've talked to founders who swear 15% of their traffic now comes from AI referrals, and others in the same niche who've seen almost none. Be careful with the high end: industry-wide figures from Search Engine Land and Adobe put AI referrals closer to 1% of total visits as of late 2025: growing fast (up several hundred percent year over year) but from a tiny base. Treat the 15% as one founder's outlier, not a benchmark you should expect to hit.

AI Crawler Directory 2025: Cheat Sheet

(ai crawler list, ai crawlers user agents)

How to use: paste this table into any internal doc or robots.txt planning sheet. Search your logs for these user-agent strings to find which AI bots already hit your site. Versions change often, so re-check the official docs each quarter.

Vendor Crawler Name Full User-Agent String Primary Purpose
OpenAI GPTBot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.3; +https://openai.com/gptbot Train and refresh ChatGPT core models
OpenAI OAI-SearchBot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot Real-time web search for ChatGPT citations
OpenAI ChatGPT-User Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot Fetches a page when a human posts the link in a chat
Anthropic ClaudeBot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ClaudeBot/1.0; +claudebot@anthropic.com Training data for Claude models
Anthropic Claude-SearchBot Mozilla/5.0 (compatible; Claude-SearchBot/1.0; +https://www.anthropic.com/claude-searchbot) Search indexing for live Claude citations
Anthropic Claude-User Mozilla/5.0 (compatible; Claude-User/1.0; +https://www.anthropic.com/claude-user) User-initiated fetches inside a Claude session
Anthropic anthropic-ai (deprecated) Mozilla/5.0 (compatible; anthropic-ai/1.0; +http://www.anthropic.com/bot.html) Legacy training string; phased out Feb 2026
Anthropic claude-web (deprecated) Mozilla/5.0 (compatible; claude-web/1.0; +http://www.anthropic.com/bot.html) Legacy fresh-web string; phased out Feb 2026
Perplexity PerplexityBot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) Indexes for Perplexity citations (not model training)
Perplexity Perplexity-User Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://www.perplexity.ai/useragent) Loads pages when users click answers
Google Google-Extended Mozilla/5.0 (compatible; Google-Extended/1.0; +http://www.google.com/bot.html) Feeds Gemini/Vertex AI; separate from search
Google GoogleOther GoogleOther Internal R&D crawler
Microsoft BingBot (Copilot) Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/W.X.Y.Z Safari/537.36 Powers Bing search and Copilot AI
Amazon Amazonbot Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Alexa Q&A and product recs
Apple Applebot Mozilla/5.0 (compatible; Applebot/1.0; +http://www.apple.com/bot.html) Siri / Spotlight search
Apple Applebot-Extended Mozilla/5.0 (compatible; Applebot-Extended/1.0; +http://www.apple.com/bot.html) Apple AI training; block via robots.txt to opt out
Meta FacebookBot Mozilla/5.0 (compatible; FacebookBot/1.0; +http://www.facebook.com/bot.html) Link previews across Meta apps
Meta meta-externalagent Mozilla/5.0 (compatible; meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)) Backup Meta crawler
LinkedIn LinkedInBot LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +http://www.linkedin.com) Professional content previews
ByteDance ByteSpider Mozilla/5.0 (compatible; Bytespider/1.0; +http://www.bytedance.com/bot.html) LLM training behind TikTok / Toutiao recs — ignores robots.txt
DuckDuckGo DuckAssistBot Mozilla/5.0 (compatible; DuckAssistBot/1.0; +http://www.duckduckgo.com/bot.html) Private AI answer engine
Cohere cohere-ai Mozilla/5.0 (compatible; cohere-ai/1.0; +http://www.cohere.ai/bot.html) Enterprise language-model training
Mistral MistralAI-User Mozilla/5.0 (compatible; MistralAI-User/1.0; +https://mistral.ai/bot) European LLM crawler
Allen Institute AI2Bot Mozilla/5.0 (compatible; AI2Bot/1.0; +http://www.allenai.org/crawler) Academic research scraping
Common Crawl CCBot CCBot/2.0 (https://commoncrawl.org/faq/) Open corpus used by many AIs; monthly crawl cycles
Diffbot Diffbot Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729; Diffbot/0.1; +http://www.diffbot.com) Structured-data extraction
Omgili omgili Mozilla/5.0 (compatible; omgili/1.0; +http://www.omgili.com/bot.html) Forums and discussion scraping
Timpi TimpiBot Timpibot/0.8 (+http://www.timpi.io) Decentralised search
You.com YouBot Mozilla/5.0 (compatible; YouBot (+http://www.you.com)) You.com AI search
DeepSeek DeepSeek (unconfirmed) No official crawler UA published Reported traffic arrives without a named UA; do not trust circulated strings
xAI Grok (unverified) No official UA; observed traffic spoofs Chrome / Safari / Go-http-client Block requires fingerprinting or rate rules, not robots.txt
Apple (Vision) Applebot-Image Mozilla/5.0 (compatible; Applebot-Image/1.0; +http://www.apple.com/bot.html) Image-focused AI ingestion

Tip: paste these strings into a log filter or grep command to find the AI crawlers already hitting your site, then adjust robots.txt and content accordingly. Two rows carry no real user-agent at all, because Grok and DeepSeek don't publish one. If a third-party registry hands you a confident GrokBot/1.0 string, be skeptical: independent reports say xAI's crawler rotates residential IPs and spoofs browser strings instead.

Reading the Logs: Spotting AI Bots

Your server logs already know which AI crawlers hit you yesterday. Grab a raw access log and pipe it through grep. Each pattern matches the official user-agent string, so you'll see exact timestamps, URLs fetched, and status codes.

# GPTBot (OpenAI)
grep -E "GPTBot/([0-9.]+)" access.log

# ClaudeBot (Anthropic, training)
grep -E "ClaudeBot/([0-9.]+)" access.log
# Claude-SearchBot (Anthropic, live citations)
grep -E "Claude-SearchBot/([0-9.]+)" access.log
# PerplexityBot
grep -E "PerplexityBot/([0-9.]+)" access.log
# Google-Extended (Gemini)
grep -E "Google-Extended/([0-9.]+)" access.log

Sample hit (truncated):

66.102.12.34 - - [18/Jul/2025:06:14:22 +0000] "GET /blog/ai-crawlers-guide HTTP/1.1" 200 8429 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.3; +https://openai.com/gptbot"

On Nginx or Apache with combined logging, the fourth field is the IP and the ninth is the status code, both useful for spotting 4xx blocks. Pipe to cut or awk to build a daily crawl-frequency report.

One thing the logs won't help with: ByteSpider, Grok, and DeepSeek. ByteSpider announces itself but ignores your robots.txt, so a grep match tells you it's there, not that a rule will stop it. You need a WAF or firewall block. Grok and DeepSeek mostly arrive disguised as ordinary browser traffic, so they never show up in a user-agent filter.

(I spent a whole afternoon chasing what I was sure was a Grok crawl in our logs. It was a misconfigured uptime monitor hammering one endpoint. The real Grok traffic, if it was even there, looked exactly like Chrome. Which is precisely the problem.)

Tip: a spike of 4xx responses to an AI bot is a lost branding opportunity. Fix robots rules or caching errors before the crawler downgrades your domain in its freshness queue.

The Real Decision: Block GPTBot, Allow It, or Segment?

This is the call everything else hangs on, so I'll be candid: nobody has a clean answer yet, and I'm skeptical of anyone who claims they do.

The debate is loud. Some owners block GPTBot in robots.txt, reasoning that OpenAI trains on their content without payment or attribution. That's a legitimate stance, and publishers like the New York Times have taken it. Others allow GPTBot freely, betting that early inclusion in the model's knowledge compounds into a visibility advantage. Both camps have smart people in them.

Here's what I've actually been able to pin down, and what I haven't.

What I'm confident about: blocking GPTBot does not hurt your Google rankings. Google-Extended is a separate crawler from Googlebot, and blocking one doesn't touch the other. Google documents this plainly. So the "but won't I tank my SEO?" fear, which I hear constantly, is unfounded.

What I think I'm seeing but can't prove: sites that allow GPTBot and have well-structured content seem to surface more often in ChatGPT responses for related questions. I measure that through manual spot-checks and our AISO monitoring, not an official API. The correlation could be coincidental. My sample for this specific observation is roughly 40 sites, nowhere near enough to claim an effect size, so I won't.

What I genuinely don't know: whether blocking GPTBot now and unblocking later leaves any lasting mark on how the model treats your domain. Whether GPTBot honors robots.txt every single time. The official line is yes, and a September 2025 paper on robots.txt governance found measurable non-compliance edge cases across crawlers. (Worth flagging: OpenAI quietly revised its docs in December 2025 to drop the claim that ChatGPT-User respects robots.txt; the guarantee now only covers OAI-SearchBot and GPTBot.) And whether being in the training data actually drives citations, versus only living in the real-time search layer.

My current play, and it's a bet rather than a certainty: allow the search and citation bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) plus GPTBot on your public content, and block everything on gated or proprietary material. The logic is asymmetric risk. If AI search becomes a major channel, you want to be in the index and the training set. If it fizzles, you've lost nothing but a few crawl requests. ByteSpider is the exception: if you want it gone, robots.txt won't do it, so reach for a firewall rule. Ask me again in six months and the answer might have moved.

Decision flow chart for robots.txt AI crawler rules: public content leads to allow training and search crawlers; gated content leads to disallow; ByteSpider requires WAF block
Fig. 2 — robots.txt controls training and search crawlers independently. ByteSpider requires a firewall block — it does not honour robots.txt.

What Different Crawlers Value

This table is built from what I've watched in log analysis across SEOJuice customer sites. The "content priority" and "media appetite" columns are my interpretation of behavior patterns, not official documentation. None of these companies publish detailed specs on what their crawlers prefer. Read it as a working hypothesis.

Crawler Content Priority JS Rendering Freshness Bias Media Appetite
GPTBot (OpenAI) Text, code snippets, metadata No (HTML only) Revisits updated pages often Low (images skipped much of the time)
ClaudeBot (Anthropic) Context-rich text and images No Prefers new articles (under 30 days) High (a meaningful share of requests are images)
PerplexityBot Factual paragraphs, clear headings No Moderate; real-time for news Medium; looks for diagrams
Google-Extended Well-structured HTML, schema Yes (renders JS) Mirrors Google crawl cadence Medium
BingBot (Copilot) Long-form text and sitemap hints Yes High for frequently updated sites Medium
CCBot (Common Crawl) Bulk text for open corpora No Monthly crawl cycles Low

If the table is right even directionally, the moves it implies are unglamorous: text-heavy bots like GPTBot and PerplexityBot reward clear headings and a concise summary up top; the image-hungry one (ClaudeBot) actually parses alt text, so descriptive tags earn you context you'd otherwise lose; and the JS-capable bots (Google-Extended, BingBot) still prefer server-rendered speed, so heavy client-side rendering slows everyone else down. None of this is exotic. It's the same hygiene that's always made pages legible, except you're now serving a second audience that can't see your JavaScript.

Four-phase AI visibility loop diagram: Identify crawlers in logs, Allow via robots.txt, Optimise content structure, Measure citations — with key stats: 4% HTML traffic, 700% growth, 93% zero-click
Fig. 3 — Run this loop every 4–8 weeks. AI crawler behavior and bot rosters change fast; what worked in Q1 needs revalidation by Q3.

Building Pages AI Crawlers Can Quote

Designing for AI visibility starts in the markup and ends on the server. Get either layer wrong and GPTBot, ClaudeBot, or Google-Extended will skim, stumble, and move on. The good news is the work is mostly things a careful editor already does.

Content Architecture

Headline hierarchy. Treat H1 through H3 as a table of contents for language models. One H1 that states the topic, H2 sections that each answer a discrete sub-question, optional H3s for supporting detail. Skip levels or stack multiple H1s and the crawler loses the thread.

<h1>AI Crawler Directory 2025</h1> <h2>What Is an AI Crawler?</h2> <h2>Complete List of AI User-Agents</h2> <h3>OpenAI GPTBot</h3> <h3>Anthropic ClaudeBot</h3> <h2>How to Optimise Your Site</h2>

Lead summaries. Open every article with two or three sentences that state the answer up front. Models often clip only the first 300–500 characters for a citation. Bury the lead and they'll quote someone who didn't.

Schema and FAQ blocks. Wrap definitions, how-tos, and product specs in FAQPage, HowTo, or Product schema. Structured data is a sign in an otherwise dim crawl. For FAQ, embed the Q&A inline so crawlers capture the context in one request.

<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "FAQPage", "mainEntity": [{ "@type": "Question", "name": "What is GPTBot?", "acceptedAnswer": { "@type": "Answer", "text": "GPTBot is OpenAI's primary web crawler used to train ChatGPT." } }] } </script>

(Honest aside on why listicles and definition pages keep winning here: it isn't a trick, it's that their shape (numbered H2s, a clean "What is X?" answer in paragraph one) happens to match the question-answer pairs an LLM is assembling. The format that's always been easy for skimming humans is the format that's easy for the model too.)

Serving It Fast

Server-side rendering. Most AI bots can't, or won't, run client-side JavaScript. Pre-render critical content and ship complete HTML. Next.js or Nuxt with SSR turned on solves this without a rebuild. The one apparent exception is Google-Extended, which does seem to render JS based on the JS-heavy pages it indexes across our customer sites, but I wouldn't bet the others do, so serve HTML and don't cross your fingers.

Alt text. ClaudeBot pulls images at high rates and reads the alt attribute for context. "GPTBot crawling diagram showing request paths" earns you a description; an empty alt makes the graphic invisible to the crawler.

Clean URLs and light assets. Short hyphenated slugs (/ai-crawler-list, not /blog?id=12345&ref=xyz) reduce crawl friction, and bots that hit a slow server tend to back off their crawl frequency, so enable Brotli or Gzip, use WebP/AVIF, and lazy-load below-fold media. The performance targets are the same ones your human visitors benefit from:

Metric Target
LCP < 2.5 s
INP < 200 ms
CLS < 0.1

Hit those and both human readers and AI crawlers consume your content without friction.

Frequently Asked Questions

Does blocking GPTBot hurt my Google rankings?

No. GPTBot and Googlebot are separate crawlers, and so is Google-Extended (the bot that feeds Gemini). Blocking GPTBot or Google-Extended in robots.txt has no documented effect on traditional search rankings. You're only choosing whether your content trains or grounds AI models, not whether it ranks.

Which AI crawler is the most active on the average site?

In my logs and in Cloudflare's 2025 data, OpenAI's bots (GPTBot and OAI-SearchBot) and Anthropic's ClaudeBot are consistently the most active, all sitting in the top five AI crawlers by request volume. Across the web, AI bots combined averaged about 4.2% of HTML requests in 2025.

How do I block a crawler that ignores robots.txt?

Some crawlers, ByteSpider being the documented example, don't honor robots.txt, so you can't stop them with a rule in that file. Block them at the server level instead: a WAF, a firewall rule, or a user-agent deny rule in Nginx or Apache. Grok and DeepSeek complicate this further by arriving with spoofed or absent user agents, so they need behavioral or rate-based detection, not a name match.

Is being in an LLM's training data the same as being cited in answers?

Not necessarily, and this is where I'd push back on confident claims. Training data (GPTBot, ClaudeBot, CCBot) shapes what a model knows; the live search and citation bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) decide what gets quoted in real time. The two layers overlap but aren't identical, and I can't yet measure how much training inclusion drives citations on its own.

Where This Leaves You

AI crawlers stopped being side traffic the moment they started pulling Googlebot-sized volume. They're feeder pipes into every chat window, voice assistant, and AI search panel your customers consult, harvesting text, schema, and images to decide which brands speak for the category.

The upside is concrete and the inputs are boring: server-side rendering, clean headings, AI-friendly schema, a deliberate robots.txt. Get those right while only a slice of sites bother, and your expertise becomes the quote the assistants repeat.

Temper the urgency with realism, though. We don't fully understand how these models weight sources, and the roster turns over every quarter as new crawlers launch and old ones (anthropic-ai, claude-web, CCBot/1.0) get retired. What I'll say without hedging is that the basics (clean HTML, fast servers, descriptive headings, a thought-through robots.txt) pay off no matter where AI search lands. Worst case, you also improve your traditional SEO.

So audit your logs this week. Welcome the right bots, fix the content signals they crave, and watch how often your brand shows up in AI answers over the next quarter.

Run the free AI Crawler Inspector to see exactly which AI bots are hitting your site right now and whether your robots.txt is letting the right ones through.

Related reading: