seojuice

What Is Googlebot? Crawl, Render, Index Explained

Vadim Kravcenko
Vadim Kravcenko
May 01, 2026 · 9 min read

TL;DR. Googlebot is not one bot but a family of crawlers, and Googlebot Smartphone has driven almost everything since mobile-first indexing became default in 2023. Its job runs in three phases (crawl, render, index) that can be hours or days apart, and the render phase is where most "Googlebot can't see my page" complaints actually live. Across SEOJuice support tickets between mid-2024 and early 2026, roughly 6 in 10 indexing escalations turned out to be render-phase issues; about 2 in 10 were crawl-phase problems (the rest were noindex tags or robots.txt mistakes). This guide covers the bot family, the three-phase pipeline, verification, what AI summaries get wrong, and how Googlebot compares to the AI crawlers in 2026.

Updated May 2026. Refreshed with SEOJuice ticket-mix data, a Cloudflare bot-fight anecdote, an AI-Overview drift section, and two callbacks to our Web Bot Auth (RFC 9421) writeup from yesterday.

I wrote this because I send the same explainer in Intercom three or four times a week. The customer says "Googlebot is blocked from my site." We open Search Console. The crawl phase is fine. The render phase fell over when a developer pushed a tab-panel refactor and didn't notice the article body now mounts after a click. I wanted one URL I could paste into the ticket.

What Googlebot Actually Is

Googlebot is the program Google uses to fetch web pages so they can be added to the Google index. When you publish a blog post and it shows up in search results, that journey starts with Googlebot requesting the URL, downloading the HTML, executing JavaScript, and passing the result to Google's indexing system.

"Googlebot" is sometimes used loosely to mean "any Google crawler." Strictly, it's the crawler that fetches pages for the main search index. Other Google crawlers exist (AdsBot for landing-page quality, Storebot for Shopping, Google-Extended for AI training opt-outs) and they follow different rules. Be specific when debugging.

Googlebot is distinct from a scraper. It reads robots.txt before each crawl, respects noindex, throttles when your server slows, and identifies itself so you can verify the hit. The new HTTP Message Signatures for crawlers proposal aims to make this verification cryptographic instead of DNS-based, but until adoption is universal the reverse-DNS check below is the operative test.

The Googlebot Family: Variants Compared

The bot you most need to think about is Googlebot Smartphone, which has crawled the mobile version of your site by default since Google completed mobile-first indexing in mid-2023. Desktop crawls still happen, but they are now the secondary case. The family tree, using Google's published user-agent reference:

Seven-row table of Googlebot variants (Smartphone, Desktop, Image, Video, News, Inspection Tool, Google-Extended) with what each fetches, its user-agent fragment, renderer, and status
The Googlebot family in 2026 — not one bot, a family of fetchers with different jobs. Get the wrong one in your logs and you'll waste hours diagnosing the wrong rendering behavior.
Crawler User-agent token Renders JS? What it indexes Share of crawl
Googlebot SmartphoneGooglebot/2.1 (Mobile)YesMobile pages for the primary index~80%+ post mobile-first
Googlebot DesktopGooglebot/2.1YesDesktop variants for the same index~10-15%
Googlebot ImageGooglebot-Image/1.0NoImages for Google ImagesVariable
Googlebot VideoGooglebot-Video/1.0NoVideo files for Google VideosVariable
Googlebot NewsNo distinct UAYes (uses Smartphone)News-eligible pagesSite-dependent
Google-InspectionToolGoogle-InspectionTool/1.0YesURL Inspection in Search ConsoleOn-demand only
Google-ExtendedGoogle-ExtendedN/ARead-only flag for Gemini training opt-outNo crawl

The Chromium version inside Googlebot is not fixed. Google substitutes current stable Chrome at request time, and the renderer tracks public Chrome within a few weeks. (For years I told customers to treat the renderer as Chrome 41, which it actually was until the 2019 evergreen update. I kept giving outdated advice into 2021 before a Martin Splitt talk on Search Off the Record set me straight.) Identify Googlebot by verified IP, not UA string.

The Three-Phase Pipeline: Crawl, Render, Index

Googlebot's job splits into three distinct phases. They do not happen at the same time, and a delay or failure in any one can keep your page out of search results. Google's JavaScript SEO docs describe it cleanly: "Google processes JavaScript web apps in three main phases: 1. Crawling 2. Rendering 3. Indexing." If you cannot name which of these phases a problem lives in, you are guessing about the fix.

Three-stage horizontal pipeline diagram of Googlebot processing a page: crawl (HTTP fetch), render (headless Chromium DOM), index (canonical and quality), with failure modes listed under each phase
What Googlebot actually does, in three phases. Each is a different machine with its own failure modes. Most debugging starts with: which phase is failing?

Phase 1: Crawling

Googlebot picks a URL from its queue, sends an HTTP request, and receives the raw HTML. No JavaScript runs yet. The crawler reads the status code, the headers (caching, X-Robots-Tag, redirects), and the raw HTML body. URLs come from XML sitemaps, internal links from indexed pages, external links from other sites, and direct submissions via URL Inspection. Before any fetch, Googlebot reads robots.txt; if a URL is disallowed, the fetch never happens.

Phase 2: Rendering

If a page needs JavaScript executed to show its content, Googlebot hands the URL to the Web Rendering Service (WRS), a headless Chromium that loads the page, runs the scripts, and produces the final rendered HTML. Google's docs: "Once Google's resources allow, a headless Chromium renders the page and executes the JavaScript."

"Once Google's resources allow" is doing a lot of work in that sentence. Rendering is expensive, so Google batches and queues it. Pages can sit in the render queue for seconds, hours, or in worst cases days. I have a 2024 screenshot of a 96-hour gap between crawl and render on a Next.js e-commerce site we audited. Official guidance: "The page may stay on this queue for a few seconds, but it can take longer than that." Queue prioritisation is undocumented.

Pure server-side rendered pages skip this queue entirely. That choice is the difference between "indexed within an hour" and "indexed two days later."

Phase 3: Indexing

Once Googlebot has the final HTML (from the crawl, or from the WRS after rendering), the indexing system parses the document, extracts text, classifies content, evaluates ranking signals, and stores it in Google's index. The page becomes eligible for search results. Indexing isn't instant; it can take additional minutes or hours after rendering.

What Breaks JavaScript Rendering

The crawl phase almost always succeeds; the page just doesn't render the way the developer expected. Six failure modes, in decreasing order of frequency on customer sites. Items 1 and 2 alone account for over half of the render-phase escalations we triage.

Six-card grid of JavaScript rendering failure modes (content behind interaction, blocked subresources, unhandled JS errors, long-running render, lazy-load gating, cookie consent walls) with symptom, detection, and fix for each
Six failure modes that ship content as invisible to the indexer even when the page renders fine in your browser. Each card lists the symptom, how to detect it, and the fix.

1. Content that requires user interaction to load

If clicking a "Show More" button is the only way to reveal a section, Googlebot won't see it. The WRS executes JavaScript but does not click buttons or scroll. Anything important should be in the DOM at load time, even if hidden via CSS the user can toggle. This is the single most common rendering failure, usually appearing in component libraries that lazy-mount tab panels, accordion bodies, and "load more" feeds.

2. Lazy-loading without proper signals

Lazy-loaded images and content blocks need either native loading="lazy" or an Intersection Observer setup the WRS can resolve. Custom lazy-loading that waits for scroll events fails under WRS because there is no scroll. For components, ensure they render server-side or use a framework with proper SSR/hydration.

3. JavaScript errors during execution

If a top-of-page script throws, downstream scripts may not run, leaving the rest of the page empty. The WRS sees whatever was rendered before the exception. Use URL Inspection's "View Tested Page" to see what Googlebot saw.

4. WAF and bot-protection rules

CAPTCHAs, Cloudflare bot fight mode set too aggressively, and naive geographic blocking can serve a 403 to Googlebot. Cloudflare's default bot-fight mode has bitten more customer sites than any other setting we debug; one B2B SaaS lost two-thirds of its indexed pages over a weekend in late 2024 after a security-team intern toggled it on, and recovery took three weeks. Whitelist verified Google IP ranges (googlebot.json) before any "block bots" feature.

5. Resources blocked in robots.txt

If your robots.txt disallows /static/ or /assets/, the WRS can't fetch the JS and CSS bundles, and your page renders without styles or with broken JavaScript. Allow Googlebot to crawl static asset paths.

6. Content gated behind authentication or cookies

Googlebot does not authenticate, does not accept cookies meaningfully, and does not maintain session state. Anything behind a login wall will not be indexed. Use the indexing API or structured data for paywalled content if you need it discoverable.

How to Verify a Request Really Is Googlebot

The Googlebot user-agent string is trivially spoofable. Real Googlebot requests come from a published range of Google-owned IPs. The reliable verification is reverse DNS followed by forward DNS:

Four-step horizontal flow for verifying Googlebot: capture IP, reverse DNS lookup, forward-confirm DNS, check googlebot.com or google.com hostname suffix, plus pass/fail outcomes
The four-step Googlebot verification: capture IP, reverse-DNS, forward-confirm, suffix check. Forward and reverse must agree before you trust the user agent.
  1. Take the IP from the access log.
  2. Reverse DNS — hostname should end in .googlebot.com or .google.com.
  3. Forward DNS on that hostname — it should resolve back to the same IP.
  4. Both pass: real Googlebot. Either fails: spoof.

Command line: host 66.249.66.1 then host crawl-66-249-66-1.googlebot.com. Automate this in your log pipeline; you'll be surprised how often "Googlebot crawl spike" turns out to be a scraper using the user-agent.

Reverse-DNS is the operative standard, but the cryptographic guarantee is "Google owns this IP range," not "this request was signed by Google." That gap is what the Web Bot Auth (RFC 9421) proposal addresses, by having crawlers sign requests with HTTP Message Signatures the origin can verify against a published key. Google has been an early implementer in 2026; the companion piece walks through the signing flow.

robots.txt and Crawl Budget

For sites under ~10,000 URLs, crawl budget is almost never a constraint. It becomes real on large sites with millions of URLs, faceted-search e-commerce, or sites wasting crawls on duplicates. Google publishes two influences: crawl rate (how fast your server can respond without errors) and crawl demand (how popular the URL is and how often it changes). On large sites, block faceted search URLs, internal site search results, paginated archives beyond page 5, session-ID parameters, and admin endpoints. Use robots.txt for crawl-time blocking and noindex for indexing-time blocking — they do different things. To speed up indexing of a new page, submit it via URL Inspection (this uses Google-InspectionTool, not Googlebot) and link it from a high-authority indexed page.

What AI Overviews Get Wrong About Googlebot

Ask ChatGPT, Claude, or Google's own AI Overviews "what is Googlebot" and you'll get a confident answer roughly 80% correct and 20% subtly wrong. The recurring drift across four engines:

  • "Googlebot uses Chrome 41 to render pages." True until the May 2019 evergreen switch. The renderer now tracks stable Chrome within a few weeks. Any answer citing Chrome 41 was trained on documentation seven years out of date.
  • "Googlebot has a crawl budget for every site." Misleading. Crawl budget applies meaningfully to sites with millions of URLs. Telling a 200-page SaaS to "optimise crawl budget" is wasted effort.
  • "Block AI crawlers by changing Googlebot's robots.txt rules." No. Google-Extended is the separate token for Gemini training opt-out. Blocking Googlebot takes you out of search; blocking Google-Extended takes you out of AI training. Conflating them is a common mistake.
  • "You can speed up Googlebot crawl rate in Search Console." The crawl-rate limiter was deprecated in early 2024. The control no longer exists.
  • "Googlebot follows nofollow links for ranking." Since 2019, nofollow is a hint, not a directive. Googlebot may follow nofollow links for crawl discovery but does not pass ranking signal through them in most cases.

Cross-reference any specific technical claim against developers.google.com/search/docs/crawling-indexing. That's the primary source; everything else is a second-hand summary.

Debugging "Googlebot Can't See This Page"

Four checks, in order, until one returns a clear answer.

Check 1, URL Inspection in Search Console. Paste the URL. The tool tells you whether Google has crawled and indexed it and lets you "View Tested Page" to see the rendered HTML and a screenshot. If the rendered HTML is missing your content, the problem is in rendering. If the page returned a non-200, the problem is in crawling. This single check resolves roughly two-thirds of the tickets we run.

Check 2, curl with Googlebot's user-agent. Run curl -A "Mozilla/5.0 ... Googlebot/2.1 ..." https://yoursite.com/path. If your server returns different content for Googlebot than for a browser, cloaking is the cause.

Check 3, robots.txt and meta tag audit. Visit https://yoursite.com/robots.txt directly and confirm the URL isn't blocked. View page source and search for noindex. A surprising fraction of "won't index" cases are noindex tags left over from staging.

Check 4, server log analysis. Filter access logs for verified-Googlebot requests over the last 30 days. If the URL never appears, it's a discoverability problem. If it appears but returns 4xx/5xx, fix the error. SEOJuice runs verified-Googlebot log analysis on every connected site.

Googlebot vs Bingbot vs the AI Crawlers

Capability comparison table of six crawlers (Googlebot, Bingbot, GPTBot, PerplexityBot, ClaudeBot, Applebot) across JavaScript render, robots.txt opt-out, DNS verification, RFC 9421 signing, nofollow handling, crawl frequency, and downstream answer engine
Googlebot vs Bingbot vs four AI crawlers across eight capabilities. The single biggest planning decision: GPTBot, PerplexityBot, and ClaudeBot don't run JS at all.
Crawler Operator Renders JS? Used for
GooglebotGoogleYes (recent Chromium)Google search index
BingbotMicrosoftYes (Edge / Chromium)Bing search index, Copilot grounding
GPTBotOpenAILimited / no SPA supportChatGPT training data
OAI-SearchBotOpenAILimitedChatGPT search retrieval
PerplexityBotPerplexityLimitedPerplexity answer engine
ClaudeBotAnthropicLimitedClaude training and retrieval
Google-ExtendedGoogleN/A (read-only signal)Opt-out flag for Gemini training

Failure modes 1, 2, and 5 above — user-interaction gating, lazy-load signals, blocked static assets — hit AI crawlers harder than Googlebot because their renderers are weaker. The same checklist works on a Perplexity-citation problem; the stakes are just lower for now.

If your content depends on client-side rendering, you may rank fine in Google but be invisible to ChatGPT, Perplexity, and Claude. The fix is the same: server-side render or pre-render. Our free AI visibility checker will tell you in under a minute whether the major AI engines can actually see your content. Separately, the AI crawlers each have their own robots.txt directives: User-agent: GPTBot blocks OpenAI training; User-agent: Google-Extended blocks Gemini training; User-agent: Googlebot still controls the regular search crawler, independently.

"The thing about Googlebot people most often miss is that crawling and rendering are not the same step. A URL can be crawled and still not have a rendered version of the content for hours." — Martin Splitt, Google Search Relations, paraphrased from his recurring point on Search Off the Record.

Frequently Asked Questions

What is Googlebot?

Googlebot is the web crawler Google uses to discover and download web pages so they can be indexed and shown in search results. It's a family of crawlers (Smartphone, Desktop, Image, Video, News). Most discussion refers to Googlebot Smartphone, the primary crawler since mobile-first indexing completed in 2023.

Does Googlebot run JavaScript?

Yes. The Web Rendering Service is a headless Chromium that tracks recent stable Chrome. The catch is the rendering queue: even when JS rendering succeeds, it can happen seconds, hours, or days after the initial crawl. Server-side rendered pages skip this queue.

How do I check if a request is really from Googlebot?

Reverse DNS the IP. Real Googlebot hits resolve to hostnames ending in .googlebot.com or .google.com. Then forward-DNS that hostname; it should resolve back to the same IP. The user-agent header alone is not proof.

Can I block Googlebot?

Yes. User-agent: Googlebot + Disallow: / in robots.txt blocks crawling and therefore indexing. For finer control, use noindex tags or block specific paths. Don't block CSS and JS bundles; the rendering service needs them.

Is Googlebot the same as GPTBot or PerplexityBot?

No. Separate crawlers run by different companies. Googlebot indexes for Google Search; GPTBot collects ChatGPT training data; PerplexityBot retrieves for Perplexity's answer engine. Each has its own UA string and its own robots.txt rules.

Why hasn't Googlebot indexed my new page yet?

Common causes in order: the page isn't linked from any indexed URL, returns a non-200 status, has a noindex tag, is blocked by robots.txt, or depends on client-side JS the rendering service hasn't processed yet. Use URL Inspection to identify which.

<script type="application/ld+json"> { "@context": "https://schema.org", "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "What is Googlebot?", "acceptedAnswer": { "@type": "Answer", "text": "Googlebot is the web crawler Google uses to discover and download web pages so they can be indexed and shown in search results. It's a family of crawlers (Smartphone, Desktop, Image, Video, News). Most discussion refers to Googlebot Smartphone, the primary crawler since mobile-first indexing completed in 2023." } }, { "@type": "Question", "name": "Does Googlebot run JavaScript?", "acceptedAnswer": { "@type": "Answer", "text": "Yes. The Web Rendering Service is a headless Chromium that executes JavaScript and tracks recent stable Chrome. The catch is the rendering queue: even when JS rendering succeeds, it can happen seconds, hours, or sometimes days after the initial crawl. Server-side rendered pages skip this queue." } }, { "@type": "Question", "name": "How do I verify a request really came from Googlebot?", "acceptedAnswer": { "@type": "Answer", "text": "Reverse DNS the IP address. Real Googlebot hits resolve to hostnames ending in .googlebot.com or .google.com. Then forward-DNS that hostname and confirm it resolves back to the same IP. The user-agent header alone is not proof — it is trivially spoofable." } }, { "@type": "Question", "name": "Can I block Googlebot from my site?", "acceptedAnswer": { "@type": "Answer", "text": "Yes. Add User-agent: Googlebot followed by Disallow: / in robots.txt. This blocks crawling, so the page won't be indexed. For finer control, use noindex meta tags or block specific paths. Do not block Googlebot from CSS and JS bundles; the rendering service needs them." } }, { "@type": "Question", "name": "Is Googlebot the same as GPTBot or PerplexityBot?", "acceptedAnswer": { "@type": "Answer", "text": "No. They are separate crawlers from different companies. Googlebot indexes for Google Search. GPTBot collects ChatGPT training data. PerplexityBot retrieves content for Perplexity's answer engine. Each has its own user-agent string and its own robots.txt rules." } }, { "@type": "Question", "name": "Why hasn't Googlebot indexed my new page yet?", "acceptedAnswer": { "@type": "Answer", "text": "Common causes in order: the page isn't linked from any indexed URL, it returns a non-200 status, it has a noindex meta tag, it's blocked by robots.txt, or its content depends on client-side JavaScript the rendering service hasn't processed yet. Use URL Inspection in Search Console to identify which case applies." } } ] } </script>

What This Means for Your Site

If your content is JavaScript-dependent and your only health check is "does it rank in Google Search," you're optimising for the strongest renderer in the crawler ecosystem and ignoring everything else. The AI crawlers are weaker, and their share of referral traffic is growing every quarter we measure it. Server-side rendering is no longer a Google optimisation. It is an AI-visibility prerequisite. The render-queue prioritisation logic remains opaque — the docs say resource-dependent, but variance across near-identical sites suggests something else is in the mix.

Related reading: