Join our community of websites already using SEOJuice to automate the boring SEO work.
See what our customers say and learn about sustainable SEO that drives long-term growth.
Explore the blog →TL;DR: AI bots now pull roughly 4% of all HTML requests on the web, close to what Googlebot pulls, and the roster crawling you changes every few weeks. Find them in your logs, decide per-bot whether to allow or block in robots.txt, and structure pages so language models can quote you cleanly. Treat it as a quarterly chore, not a one-time fix.
I asked ChatGPT to recommend an SEO tool last spring. We weren't on the list. Three competitors were, including one I'd never heard of with a fraction of our backlinks. That stung, and it also told me exactly where the next fight is. It isn't blue-link rankings anymore. It's whether GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and the two dozen bots behind them decide your paragraph is worth repeating.
Here's the stat that reorganised my mental model of who's crawling us. Cloudflare's 2025 Year in Review puts AI bots (everything except Googlebot) at an average of 4.2% of HTML requests across the web in 2025, bouncing between 2.4% and 6.4% month to month. Googlebot alone was about 4.5%.
Sit with that for a second. The combined AI crawler swarm now pulls almost as much HTML out of the web as Google's main spider does. For a decade I treated Googlebot as the robot, the one whose crawl budget I obsessed over, whose render path I debugged at 1am. It turns out there's a second Googlebot-sized appetite eating my pages, and until last year I wasn't watching it at all. OpenAI and Anthropic sit consistently in Cloudflare's top five most active.
(The first time I actually grepped our access log for AI user-agents, I assumed I'd find a trickle. GPTBot had hit us more times that week than Bingbot had all month. I'd been blind to a channel that was already bigger than one I'd spent years optimising for.)
That's the whole argument for paying attention, and it's why I structured this piece around one decision rather than a tour of the ecosystem. The decision is: which of these bots do you let in, and what do you feed them? Everything else (the directory, the log filters, the markup tips) is in service of getting that one call right.
Upfront caveat: a lot of this is still unsettled, including at SEOJuice. We've watched AI crawler behavior across customer sites since early 2025, and the data shifts month to month. Some of what's below is confirmed across hundreds of sites. Some is me reading server logs and timing correlations and making a call. I'll flag which is which.
Traditional search bots (Googlebot, Bingbot) visit your pages to decide how they rank. AI crawlers read your content to teach large language models how to answer questions. When GPTBot ingests your article, it isn't judging whether you deserve position 1. It's deciding whether your paragraph deserves to be quoted the next time someone asks ChatGPT for advice. Different distribution channel, different rules.
From what I can see across the sites in our AI visibility monitoring, the ones that welcome these bots and write content that's easy to parse do show up more often inside AI answers. I'm deliberately not putting a percentage on that. The methodology has holes (spot-check sampling, manual verification, selection bias from sites that opted into monitoring), so I'd be inventing precision I don't have. The direction is consistent; the magnitude is a guess.
Meanwhile most competitors are still staring at Search Console, not realising a meaningful slice of their server logs is LLM crawlers quietly indexing, or skipping, their expertise.
How big this channel gets, and how fast, nobody honestly knows. I've talked to founders who swear 15% of their traffic now comes from AI referrals, and others in the same niche who've seen almost none. Be careful with the high end: industry-wide figures from Search Engine Land and Adobe put AI referrals closer to 1% of total visits as of late 2025: growing fast (up several hundred percent year over year) but from a tiny base. Treat the 15% as one founder's outlier, not a benchmark you should expect to hit.
(ai crawler list, ai crawlers user agents)
How to use: paste this table into any internal doc or robots.txt planning sheet. Search your logs for these user-agent strings to find which AI bots already hit your site. Versions change often, so re-check the official docs each quarter.
| Vendor | Crawler Name | Full User-Agent String | Primary Purpose |
|---|---|---|---|
| OpenAI | GPTBot | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.3; +https://openai.com/gptbot |
Train and refresh ChatGPT core models |
| OpenAI | OAI-SearchBot | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot |
Real-time web search for ChatGPT citations |
| OpenAI | ChatGPT-User | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot |
Fetches a page when a human posts the link in a chat |
| Anthropic | ClaudeBot | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ClaudeBot/1.0; +claudebot@anthropic.com |
Training data for Claude models |
| Anthropic | Claude-SearchBot | Mozilla/5.0 (compatible; Claude-SearchBot/1.0; +https://www.anthropic.com/claude-searchbot) |
Search indexing for live Claude citations |
| Anthropic | Claude-User | Mozilla/5.0 (compatible; Claude-User/1.0; +https://www.anthropic.com/claude-user) |
User-initiated fetches inside a Claude session |
| Anthropic | anthropic-ai (deprecated) | Mozilla/5.0 (compatible; anthropic-ai/1.0; +http://www.anthropic.com/bot.html) |
Legacy training string; phased out Feb 2026 |
| Anthropic | claude-web (deprecated) | Mozilla/5.0 (compatible; claude-web/1.0; +http://www.anthropic.com/bot.html) |
Legacy fresh-web string; phased out Feb 2026 |
| Perplexity | PerplexityBot | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) |
Indexes for Perplexity citations (not model training) |
| Perplexity | Perplexity-User | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://www.perplexity.ai/useragent) |
Loads pages when users click answers |
| Google-Extended | Mozilla/5.0 (compatible; Google-Extended/1.0; +http://www.google.com/bot.html) |
Feeds Gemini/Vertex AI; separate from search | |
| GoogleOther | GoogleOther |
Internal R&D crawler | |
| Microsoft | BingBot (Copilot) | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/W.X.Y.Z Safari/537.36 |
Powers Bing search and Copilot AI |
| Amazon | Amazonbot | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) |
Alexa Q&A and product recs |
| Apple | Applebot | Mozilla/5.0 (compatible; Applebot/1.0; +http://www.apple.com/bot.html) |
Siri / Spotlight search |
| Apple | Applebot-Extended | Mozilla/5.0 (compatible; Applebot-Extended/1.0; +http://www.apple.com/bot.html) |
Apple AI training; block via robots.txt to opt out |
| Meta | FacebookBot | Mozilla/5.0 (compatible; FacebookBot/1.0; +http://www.facebook.com/bot.html) |
Link previews across Meta apps |
| Meta | meta-externalagent | Mozilla/5.0 (compatible; meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)) |
Backup Meta crawler |
| LinkedInBot | LinkedInBot/1.0 (compatible; Mozilla/5.0; Jakarta Commons-HttpClient/3.1 +http://www.linkedin.com) |
Professional content previews | |
| ByteDance | ByteSpider | Mozilla/5.0 (compatible; Bytespider/1.0; +http://www.bytedance.com/bot.html) |
LLM training behind TikTok / Toutiao recs — ignores robots.txt |
| DuckDuckGo | DuckAssistBot | Mozilla/5.0 (compatible; DuckAssistBot/1.0; +http://www.duckduckgo.com/bot.html) |
Private AI answer engine |
| Cohere | cohere-ai | Mozilla/5.0 (compatible; cohere-ai/1.0; +http://www.cohere.ai/bot.html) |
Enterprise language-model training |
| Mistral | MistralAI-User | Mozilla/5.0 (compatible; MistralAI-User/1.0; +https://mistral.ai/bot) |
European LLM crawler |
| Allen Institute | AI2Bot | Mozilla/5.0 (compatible; AI2Bot/1.0; +http://www.allenai.org/crawler) |
Academic research scraping |
| Common Crawl | CCBot | CCBot/2.0 (https://commoncrawl.org/faq/) |
Open corpus used by many AIs; monthly crawl cycles |
| Diffbot | Diffbot | Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729; Diffbot/0.1; +http://www.diffbot.com) |
Structured-data extraction |
| Omgili | omgili | Mozilla/5.0 (compatible; omgili/1.0; +http://www.omgili.com/bot.html) |
Forums and discussion scraping |
| Timpi | TimpiBot | Timpibot/0.8 (+http://www.timpi.io) |
Decentralised search |
| You.com | YouBot | Mozilla/5.0 (compatible; YouBot (+http://www.you.com)) |
You.com AI search |
| DeepSeek | DeepSeek (unconfirmed) | No official crawler UA published | Reported traffic arrives without a named UA; do not trust circulated strings |
| xAI | Grok (unverified) | No official UA; observed traffic spoofs Chrome / Safari / Go-http-client | Block requires fingerprinting or rate rules, not robots.txt |
| Apple (Vision) | Applebot-Image | Mozilla/5.0 (compatible; Applebot-Image/1.0; +http://www.apple.com/bot.html) |
Image-focused AI ingestion |
Tip: paste these strings into a log filter or
grepcommand to find the AI crawlers already hitting your site, then adjust robots.txt and content accordingly. Two rows carry no real user-agent at all, because Grok and DeepSeek don't publish one. If a third-party registry hands you a confidentGrokBot/1.0string, be skeptical: independent reports say xAI's crawler rotates residential IPs and spoofs browser strings instead.
Your server logs already know which AI crawlers hit you yesterday. Grab a raw access log and pipe it through grep. Each pattern matches the official user-agent string, so you'll see exact timestamps, URLs fetched, and status codes.
# GPTBot (OpenAI)
grep -E "GPTBot/([0-9.]+)" access.log
# ClaudeBot (Anthropic, training)
grep -E "ClaudeBot/([0-9.]+)" access.log
# Claude-SearchBot (Anthropic, live citations)
grep -E "Claude-SearchBot/([0-9.]+)" access.log
# PerplexityBot
grep -E "PerplexityBot/([0-9.]+)" access.log
# Google-Extended (Gemini)
grep -E "Google-Extended/([0-9.]+)" access.log
Sample hit (truncated):
66.102.12.34 - - [18/Jul/2025:06:14:22 +0000] "GET /blog/ai-crawlers-guide HTTP/1.1" 200 8429 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.3; +https://openai.com/gptbot"
On Nginx or Apache with combined logging, the fourth field is the IP and the ninth is the status code, both useful for spotting 4xx blocks. Pipe to cut or awk to build a daily crawl-frequency report.
One thing the logs won't help with: ByteSpider, Grok, and DeepSeek. ByteSpider announces itself but ignores your robots.txt, so a grep match tells you it's there, not that a rule will stop it. You need a WAF or firewall block. Grok and DeepSeek mostly arrive disguised as ordinary browser traffic, so they never show up in a user-agent filter.
(I spent a whole afternoon chasing what I was sure was a Grok crawl in our logs. It was a misconfigured uptime monitor hammering one endpoint. The real Grok traffic, if it was even there, looked exactly like Chrome. Which is precisely the problem.)
Tip: a spike of 4xx responses to an AI bot is a lost branding opportunity. Fix robots rules or caching errors before the crawler downgrades your domain in its freshness queue.
This is the call everything else hangs on, so I'll be candid: nobody has a clean answer yet, and I'm skeptical of anyone who claims they do.
The debate is loud. Some owners block GPTBot in robots.txt, reasoning that OpenAI trains on their content without payment or attribution. That's a legitimate stance, and publishers like the New York Times have taken it. Others allow GPTBot freely, betting that early inclusion in the model's knowledge compounds into a visibility advantage. Both camps have smart people in them.
Here's what I've actually been able to pin down, and what I haven't.
What I'm confident about: blocking GPTBot does not hurt your Google rankings. Google-Extended is a separate crawler from Googlebot, and blocking one doesn't touch the other. Google documents this plainly. So the "but won't I tank my SEO?" fear, which I hear constantly, is unfounded.
What I think I'm seeing but can't prove: sites that allow GPTBot and have well-structured content seem to surface more often in ChatGPT responses for related questions. I measure that through manual spot-checks and our AISO monitoring, not an official API. The correlation could be coincidental. My sample for this specific observation is roughly 40 sites, nowhere near enough to claim an effect size, so I won't.
What I genuinely don't know: whether blocking GPTBot now and unblocking later leaves any lasting mark on how the model treats your domain. Whether GPTBot honors robots.txt every single time. The official line is yes, and a September 2025 paper on robots.txt governance found measurable non-compliance edge cases across crawlers. (Worth flagging: OpenAI quietly revised its docs in December 2025 to drop the claim that ChatGPT-User respects robots.txt; the guarantee now only covers OAI-SearchBot and GPTBot.) And whether being in the training data actually drives citations, versus only living in the real-time search layer.
My current play, and it's a bet rather than a certainty: allow the search and citation bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) plus GPTBot on your public content, and block everything on gated or proprietary material. The logic is asymmetric risk. If AI search becomes a major channel, you want to be in the index and the training set. If it fizzles, you've lost nothing but a few crawl requests. ByteSpider is the exception: if you want it gone, robots.txt won't do it, so reach for a firewall rule. Ask me again in six months and the answer might have moved.
This table is built from what I've watched in log analysis across SEOJuice customer sites. The "content priority" and "media appetite" columns are my interpretation of behavior patterns, not official documentation. None of these companies publish detailed specs on what their crawlers prefer. Read it as a working hypothesis.
| Crawler | Content Priority | JS Rendering | Freshness Bias | Media Appetite |
|---|---|---|---|---|
| GPTBot (OpenAI) | Text, code snippets, metadata | No (HTML only) | Revisits updated pages often | Low (images skipped much of the time) |
| ClaudeBot (Anthropic) | Context-rich text and images | No | Prefers new articles (under 30 days) | High (a meaningful share of requests are images) |
| PerplexityBot | Factual paragraphs, clear headings | No | Moderate; real-time for news | Medium; looks for diagrams |
| Google-Extended | Well-structured HTML, schema | Yes (renders JS) | Mirrors Google crawl cadence | Medium |
| BingBot (Copilot) | Long-form text and sitemap hints | Yes | High for frequently updated sites | Medium |
| CCBot (Common Crawl) | Bulk text for open corpora | No | Monthly crawl cycles | Low |
If the table is right even directionally, the moves it implies are unglamorous: text-heavy bots like GPTBot and PerplexityBot reward clear headings and a concise summary up top; the image-hungry one (ClaudeBot) actually parses alt text, so descriptive tags earn you context you'd otherwise lose; and the JS-capable bots (Google-Extended, BingBot) still prefer server-rendered speed, so heavy client-side rendering slows everyone else down. None of this is exotic. It's the same hygiene that's always made pages legible, except you're now serving a second audience that can't see your JavaScript.
Designing for AI visibility starts in the markup and ends on the server. Get either layer wrong and GPTBot, ClaudeBot, or Google-Extended will skim, stumble, and move on. The good news is the work is mostly things a careful editor already does.
Headline hierarchy. Treat H1 through H3 as a table of contents for language models. One H1 that states the topic, H2 sections that each answer a discrete sub-question, optional H3s for supporting detail. Skip levels or stack multiple H1s and the crawler loses the thread.
<h1>AI Crawler Directory 2025</h1> <h2>What Is an AI Crawler?</h2> <h2>Complete List of AI User-Agents</h2> <h3>OpenAI GPTBot</h3> <h3>Anthropic ClaudeBot</h3> <h2>How to Optimise Your Site</h2>
Lead summaries. Open every article with two or three sentences that state the answer up front. Models often clip only the first 300–500 characters for a citation. Bury the lead and they'll quote someone who didn't.
Schema and FAQ blocks. Wrap definitions, how-tos, and product specs in FAQPage, HowTo, or Product schema. Structured data is a sign in an otherwise dim crawl. For FAQ, embed the Q&A inline so crawlers capture the context in one request.
<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "FAQPage", "mainEntity": [{ "@type": "Question", "name": "What is GPTBot?", "acceptedAnswer": { "@type": "Answer", "text": "GPTBot is OpenAI's primary web crawler used to train ChatGPT." } }] } </script>
(Honest aside on why listicles and definition pages keep winning here: it isn't a trick, it's that their shape (numbered H2s, a clean "What is X?" answer in paragraph one) happens to match the question-answer pairs an LLM is assembling. The format that's always been easy for skimming humans is the format that's easy for the model too.)
Server-side rendering. Most AI bots can't, or won't, run client-side JavaScript. Pre-render critical content and ship complete HTML. Next.js or Nuxt with SSR turned on solves this without a rebuild. The one apparent exception is Google-Extended, which does seem to render JS based on the JS-heavy pages it indexes across our customer sites, but I wouldn't bet the others do, so serve HTML and don't cross your fingers.
Alt text. ClaudeBot pulls images at high rates and reads the alt attribute for context. "GPTBot crawling diagram showing request paths" earns you a description; an empty alt makes the graphic invisible to the crawler.
Clean URLs and light assets. Short hyphenated slugs (/ai-crawler-list, not /blog?id=12345&ref=xyz) reduce crawl friction, and bots that hit a slow server tend to back off their crawl frequency, so enable Brotli or Gzip, use WebP/AVIF, and lazy-load below-fold media. The performance targets are the same ones your human visitors benefit from:
| Metric | Target |
|---|---|
| LCP | < 2.5 s |
| INP | < 200 ms |
| CLS | < 0.1 |
Hit those and both human readers and AI crawlers consume your content without friction.
No. GPTBot and Googlebot are separate crawlers, and so is Google-Extended (the bot that feeds Gemini). Blocking GPTBot or Google-Extended in robots.txt has no documented effect on traditional search rankings. You're only choosing whether your content trains or grounds AI models, not whether it ranks.
In my logs and in Cloudflare's 2025 data, OpenAI's bots (GPTBot and OAI-SearchBot) and Anthropic's ClaudeBot are consistently the most active, all sitting in the top five AI crawlers by request volume. Across the web, AI bots combined averaged about 4.2% of HTML requests in 2025.
Some crawlers, ByteSpider being the documented example, don't honor robots.txt, so you can't stop them with a rule in that file. Block them at the server level instead: a WAF, a firewall rule, or a user-agent deny rule in Nginx or Apache. Grok and DeepSeek complicate this further by arriving with spoofed or absent user agents, so they need behavioral or rate-based detection, not a name match.
Not necessarily, and this is where I'd push back on confident claims. Training data (GPTBot, ClaudeBot, CCBot) shapes what a model knows; the live search and citation bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) decide what gets quoted in real time. The two layers overlap but aren't identical, and I can't yet measure how much training inclusion drives citations on its own.
AI crawlers stopped being side traffic the moment they started pulling Googlebot-sized volume. They're feeder pipes into every chat window, voice assistant, and AI search panel your customers consult, harvesting text, schema, and images to decide which brands speak for the category.
The upside is concrete and the inputs are boring: server-side rendering, clean headings, AI-friendly schema, a deliberate robots.txt. Get those right while only a slice of sites bother, and your expertise becomes the quote the assistants repeat.
Temper the urgency with realism, though. We don't fully understand how these models weight sources, and the roster turns over every quarter as new crawlers launch and old ones (anthropic-ai, claude-web, CCBot/1.0) get retired. What I'll say without hedging is that the basics (clean HTML, fast servers, descriptive headings, a thought-through robots.txt) pay off no matter where AI search lands. Worst case, you also improve your traditional SEO.
So audit your logs this week. Welcome the right bots, fix the content signals they crave, and watch how often your brand shows up in AI answers over the next quarter.
Run the free AI Crawler Inspector to see exactly which AI bots are hitting your site right now and whether your robots.txt is letting the right ones through.
Related reading:
no credit card required