seojuice

What Web Bot Auth Means If You're Already Blocking AI Crawlers

Vadim Kravcenko
Vadim Kravcenko
May 16, 2026 · 12 min read

TL;DR: Web Bot Auth is RFC 9421 HTTP Message Signatures applied to crawler traffic. Bots sign their requests with a private key, publish public keys at a .well-known directory, and let you verify the signature instead of trusting the User-Agent header. Google publishes keys at agent.bot.goog for its AI-browsing agent today; Googlebot proper still isn't signed. The job for 2026 is to add a signature-verification path without ripping out reverse-DNS, because most Google-claiming traffic is still unsigned and will be for at least another year.

An operator I work with wrote a Cloudflare WAF rule in 2024. It blocks anything whose UA contains GPTBot, ClaudeBot, or PerplexityBot, and allows anything whose UA contains Googlebot. The rule held for two years. Then in spring 2026 a new user agent started showing up in the access logs, Google-Agent, with three unfamiliar headers attached: Signature-Agent, Signature-Input, and Signature. The 2024 rule has no opinion on those headers; it just reads the UA. That gap, between rules written against the old verification model and traffic arriving under the new one, is what this piece is about. Not "what is robots.txt," not "should I block AI crawlers." This is for operators who already maintain a bot-policy ruleset and need to know what Web Bot Auth changes for it.

What Web Bot Auth actually is

Web Bot Auth is a bot-flavored profile of RFC 9421 HTTP Message Signatures. The bot signs each outgoing request with a private key. The site fetches the bot's public key from a .well-known/http-message-signatures-directory URL on a domain the bot controls, then verifies that the signature on the incoming request was produced by the matching private key. If it was, the request's origin claim is cryptographically attested. If it wasn't, the request is forged.

Three new request headers do the work. Signature-Agent points to the bot's key directory. Signature-Input lists what's signed plus metadata: keyid, created and expires timestamps, algorithm, and the literal tag="web-bot-auth" string that marks this as a bot signature (not some other RFC 9421 use case). Signature carries the cryptographic bytes.

"This document describes a mechanism for creating, encoding, and verifying digital signatures or message authentication codes over components of an HTTP message." — A. Backman, J. Richer, M. Sporny, RFC 9421 (HTTP Message Signatures, abstract)

I'm confident about this layer because the spec is settled. RFC 9421 was published as an IETF Proposed Standard in February 2024: stable, implemented, not going to shift under you. The bot profile on top of the RFC is what makes this Web Bot Auth. The IETF draft draft-meunier-web-bot-auth-architecture nails down the conventions: the tag="web-bot-auth" string in the input, the well-known directory URL shape, the recommended covered components (at minimum @authority and signature-agent), and the expectation that the directory is cached against its own Cache-Control header. None of that is in RFC 9421 itself. RFC 9421 is the algebra; Web Bot Auth is the use case. (Side note: the draft is now backed by a chartered IETF working group, not a lone individual submission, which is a stronger standards-track signal than "it's just a draft" suggests.)

Why UA + IP verification stopped being enough

The verification stack has three layers, each with a known failure mode. User-Agent is text; anyone can set it. Reverse DNS works for Googlebot but is awkward for newer agent crawlers routed through general-purpose infrastructure. IP allowlists are brittle because cloud egress ranges shift without warning.

Johannes Ullrich at the SANS Internet Storm Center made the UA-spoofing point bluntly in a September 2025 diary entry: users figured out a long time ago that setting your user agent to Googlebot can get you past some paywalls. It's a one-line header edit that the entire allow-by-UA model trusts anyway.

The IP-allowlist side has a different but related problem. Cloudflare's Thibault Meunier and Mari Galicer, who shepherded the Web Bot Auth proposal at the IETF, framed it this way in their May 2025 post: "connections from the crawling service might be shared by multiple users, such as in the case of privacy proxies and VPNs, and these ranges, often maintained by cloud providers, change over time." An allowlist that was correct on Monday can be wrong by Friday.

The agent-traffic shift makes this worse: when a crawler acts on behalf of an individual user from inside a chat session, the source profile fragments. Cloudflare flagged the change directly: "Bots are no longer directed only by the bot owners, but also by individual end users to act on their behalf."

Comparison table of four crawler verification methods (User-Agent, reverse DNS, IP allowlist, Web Bot Auth) across trust level, operator cost, failure mode, and latency
Four ways to verify a crawler claim. Web Bot Auth is the only rail that survives a UA spoofer behind a proxy with a new egress IP.
MethodTrustOperator costFails onLatency
User-Agent stringLowestFreeAnyone can spoof; SANS notes the Googlebot UA has long bypassed paywalls0 ms
Reverse DNS + forward confirmMedium~1 ms per requestOnly works for crawlers with stable PTR records (Googlebot proper, Bingbot)~1-5 ms
IP allowlist (CIDR ranges)MediumList maintenanceCloud egress ranges shift; shared with privacy proxies and VPNs0 ms
Web Bot Auth (RFC 9421)HighMiddleware + key cacheOnly the bot operators that have published a key directorySub-millisecond (cached key)

Cryptographic verification is the one rail that survives all three legacy failure modes. It doesn't care about the source IP, doesn't trust the UA, and doesn't need a reverse-DNS lookup. It cares about the key. One note on the latency column, since I want to be precise about where the precision ends: the verify itself is microseconds against a warm cache (ed25519 does tens of thousands of verifications per second per core), but the real added cost depends on your stack. The most credible published number I've found is a SeatGeek writeup on verifying signed ChatGPT traffic at their Kong gateway: roughly 0.6–0.9 ms per request, about 3% of average gateway time. Treat "sub-millisecond" as the honest claim and benchmark your own layer before you quote a number.

What a signed Googlebot request actually looks like

A signed request from Google's agent follows the shape documented in Cloudflare's reference docs and Google's developer guide. Approximate form, with the keyid abbreviated:

GET /article/example HTTP/1.1
Host: yoursite.com
User-Agent: Mozilla/5.0 (compatible; Google-Agent/1.0; ...)
Signature-Agent: g="https://agent.bot.goog"
Signature-Input: sig=("@authority" "signature-agent")
 ;created=1735689600;keyid="poqkLGiymh_W0uP6PZFw-dvez3QJT5SolqXBCW38r0U"
 ;alg="ed25519";expires=1735693200;tag="web-bot-auth"
Signature: sig=:MEQCIBmw...truncated...:

Read it left to right. The literal g="https://agent.bot.goog" in Signature-Agent resolves to a directory at https://agent.bot.goog/.well-known/http-message-signatures-directory; Signature-Input attests the @authority derived component (the host name) and the signature-agent header itself, with keyid picking one key out of the directory; Signature carries the ed25519 bytes.

The thing nobody warns you about: that Signature-Input value wraps across physical lines in the raw request, and the first time I eyeballed one in an access log I assumed the log had truncated it. It hadn't. Those leading-space continuation lines are a single logical header field, and if your middleware does naive line-by-line parsing you'll silently drop half the parameters and then spend an afternoon wondering why tag reads as empty. Let a real structured-fields parser handle it; don't split on newlines yourself.

Annotated anatomy of a signed Web Bot Auth request showing Signature-Agent, Signature-Input, and Signature headers with the keyid, expires, tag, and alg parameters called out
The three headers in a signed bot request. Signature-Agent locates the public key, Signature-Input describes what's signed, Signature carries the bytes.

The semantics of each parameter, in table form, since this is the bit operators get wrong on a first read:

Header / parameterWhat it doesWhat you check
Signature-AgentPoints to the bot's public-key directoryIs the URL one you trust? (For Google: https://agent.bot.goog)
Signature-Input covered componentsLists which parts of the request are signedAt minimum @authority and signature-agent should be present
keyidPicks one key out of the directory (JWK thumbprint)Does the directory have a key with this thumbprint?
created / expiresValidity window in seconds since Unix epochIs the request within the window? expires is a hard fail
algSignature algorithmUsually ed25519 in Web Bot Auth; your verifier needs that algorithm
tagProfile markerMust be the literal string web-bot-auth
SignatureThe signature bytesVerify with the public key matching keyid

One caveat Google states directly: "Not all Google user agents are using Web Bot Auth." In May 2026 the user agent that consistently signs is Google-Agent, the AI-browsing agent behind Google's AI Mode features. Googlebot proper, the indexing crawler that drives most of your organic traffic, is not signed yet. Plan your rules accordingly.

The verification flow, end to end

The verification path is four steps, none of them expensive. The moving pieces are the directory cache and the algorithm library, not the math.

When a request arrives, look for a Signature-Agent header. If it's missing, the request is unsigned and you fall through to the legacy path (reverse DNS, UA, IP). Most requests are still in this bucket in 2026.

If the header is there, parse Signature-Input and pull out keyid, created, expires, alg, and tag. Reject anything where tag isn't the literal web-bot-auth string, and anything past expires. Both rejections happen before you touch the public key.

Only then do you fetch the public-key directory at the URL given by Signature-Agent. Honor the response's Cache-Control header; Google's directory sets one. Cache it in memory or Redis, refresh on expiry, delete keys that disappear across refreshes (key rotation), and pull out the key whose JWK thumbprint matches keyid.

The last step is the verify itself: check the signature against the components named in Signature-Input. If it passes, the request was provably produced by the holder of that private key. If it fails, treat the request as forged.

Web Bot Auth verification flow diagram: request arrives, check Signature-Agent, fetch and cache directory, verify signature, allow or deny
The four-step verification flow. Most of the cost is the directory cache. The cryptographic verify is microseconds.

The directory fetch is where the first production failure lives, and almost nobody hits it during testing. They hit it three weeks later. You ship the middleware. Verification passes in dev, passes in staging, passes for a week in prod. Then one morning your "signature invalid" counter spikes, signed Google-Agent traffic starts silently falling through to the unsigned legacy path, and nothing in your code changed. What changed is on the other side: the bot operator rotated their signing key. The new key is live in their directory. Your origin is still serving requests against the old cached copy, because you set a long cache TTL (or worse, cached the directory once on first sight and never re-read Cache-Control). Every request signed with the new key fails against your stale public key until the cache expires.

The maddening part is that it isn't a bug in your verify logic. Your verify logic is correct; it's faithfully rejecting a signature it can't validate against the key it has. The bug is the cache policy. The fix is unglamorous: treat Cache-Control as authoritative, refresh on expiry rather than on a fixed interval you picked, and reconcile the key set on every refresh so a rotated-out key leaves your cache when it leaves the directory. The same mismatch shows up in JWKS rotation everywhere; SeatGeek called out exactly this when their cache refresh interval drifted out of sync with the upstream rotation. (The opposite failure is just as easy: under-cache, and you refetch far too often, hammering an endpoint that told you how long to wait.) Don't over-cache, don't under-cache, and on a fetch failure with an expired cache, fall back to the unsigned path rather than blocking.

What changes for your AI-bot-block rules

Here's the central operator question. You wrote rules. The protocol changed under them. What do you do?

The reassuring part first. Rules that block by UA contains GPTBot, ClaudeBot, or PerplexityBot are unaffected. The request still arrives with a recognizable UA, and a spoofed GPTBot was always pretending to be the bot you wanted to block anyway.

The part that needs attention. Rules that allow by UA contains Googlebot are now under-specified; a spoofer with a Googlebot UA still passes them. The fix isn't to rewrite the rule overnight (the signed share is too small for that, more below), but to add a parallel rule path: verify the signature on signed Google traffic, treat the unsigned remainder with reverse-DNS verification. Cloudflare's verified-bots team summarized the gap:

"Existing identification methods rely on a combination of IP address range (which may be shared by other services, or change over time) and user-agent header (easily spoofable). These have limitations and deficiencies." — Cloudflare verified-bots team, July 2025

The two-stack model is the right mental picture. One ruleset handles signed traffic: verify the signature, check the keyid against a trusted set, validate the expires, route on the resulting verified identity. The other handles unsigned traffic, doing the legacy reverse-DNS plus UA plus IP work exactly as today. Don't delete the legacy rules; as of mid-2026, most of your real Google traffic still flows through them.

Verifying signatures at your origin (when you're not on Cloudflare)

If you sit behind Cloudflare, the work is small. Cloudflare validates signatures at the edge and exposes the result via cf.verified_bot_category in WAF Custom Rules and Transform Rules. Your rule becomes "if cf.verified_bot_category is the category you want, route accordingly," and the cryptography is somebody else's problem. (So is the key-rotation cache bug from earlier: Cloudflare manages the directory cache for you, which is a real reason to let the edge handle it if you already live there.)

If you don't sit behind a verifying CDN, you do the work at your origin: a small middleware in front of nginx or your app server. It intercepts requests carrying a Signature-Agent header, fetches the bot's .well-known directory on first sight (cached after), verifies per RFC 9421, and sets an internal X-Verified-Bot trust header your downstream rules can read.

The Cloudflare research team open-sourced the verifying pieces at cloudflareresearch/web-bot-auth. The Rust crate and TypeScript npm package (both named web-bot-auth) carry the verification logic, and the repo ships a Caddy plugin and Cloudflare Worker examples. None of these are audited (the README says so in plain language, and I'd take that seriously), but the verification surface is small, and the alternative is implementing RFC 9421 yourself.

Decision tree for handling a signed Web Bot Auth request: branches for unsigned, signature invalid, expires passed, keyid unknown, and valid signature
The five end-states for an incoming request. Only the rightmost branch (valid signature, fresh keyid, within window) earns the verified-bot trust label.

Either way, don't embed signature verification in business code. Keep it in the front-of-house tier, where it can be audited and replaced.

The reverse-DNS fallback isn't going away

The reason I keep returning to "don't delete the legacy rules" is that the signed share is still small. In May 2026 the Googlebot indexing crawler is not signing requests; only the AI-browsing Google-Agent signs, and for most sites that's a small slice of total Google traffic. Google says so plainly: the same crawler documentation that introduces Web Bot Auth tells operators to continue relying on IP addresses, reverse DNS, and user-agent strings alongside the new protocol. That isn't a hedge, it's the operating model. They run in parallel, and in 2026 the legacy stack carries more weight.

So keep both rails running and treat the cryptographic rail as additive. And one counter-anti-pattern: don't write a rule today that blocks unsigned Googlebot UA traffic. You'll de-index yourself within a crawl cycle.

What to actually do this quarter

Four-item checklist. None of these requires a vendor.

Start by inventorying your existing bot rules. Tag each one by what it actually verifies (UA, reverse DNS, IP range, or signature) and clean up the duplicate or stale rules most audits surface before adding new ones.

Then add a signature-verification path. On Cloudflare, enable the verified-bots edge validation and branch one rule on cf.verified_bot_category. At your own origin, install the WBA middleware, point it at agent.bot.goog, and surface a trust header your existing rules can read. Set the directory cache to honor Cache-Control from day one; that's the line that saves you from the rotation outage later.

Keep the reverse-DNS path for the much larger pool of unsigned Google-claiming traffic. Don't tighten it, don't replace it, run it alongside the signature path.

Last, schedule the quarterly audit: signed share of Google-claiming traffic, signed share of AI-agent traffic, and the percentage of spoofers the signature path caught that the legacy rules missed. The numbers move slowly through 2026 and, if adoption follows the usual curve, faster through 2027 (though I'd hold that loosely, since the rollout pace is Google's call, not mine).

If your team also runs broader AI-crawler policy, the AI crawler playbook and the Cloudflare AI-bot-block disable piece are companion reads on the allow/block side. This one is the identity side.

What this does NOT solve

Web Bot Auth is identity, not authorization. The signature attests that a request was produced by a specific bot; it says nothing about whether that bot is allowed to read the URL. A verified, signed Google-Agent can still scrape your paid content if your rules let it through. The signature buys trust in the source claim; the policy still belongs to you.

It has no opinion on robots.txt either. A signed bot that ignores robots.txt is still a robots violator; signing doesn't grant additional access. If you want the signed AI-browsing agent to skip your paid section, you tell it so via robots and enforce it in your rules. And it doesn't decide between AI-search routing and traditional search routing for you: Web Bot Auth tells you "this is really Google-Agent," but whether Google-Agent gets the same content treatment as Googlebot is your policy call. The piece on optimizing for Perplexity, ChatGPT search, and Google AI Mode covers that routing side.

The honest assessment: this is one rail in a three-rail stack. Signature for identity, reverse DNS for legacy verification, robots policy for authorization. The new rail hardens the first one. The operators I see succeed in 2026 treat all three as load-bearing.

The audit cadence as adoption grows

The quarterly job is the signed share itself: pull a sample of Google-claiming traffic from access logs, bucket it by signed and unsigned, and watch the trend. I want to be honest about the thresholds, because there's no published benchmark for when the signed share becomes the majority of real Google traffic, so any number I'd quote would be one I made up. Track your own logs instead. When the signed bucket stops being a rounding error, the verification path starts pulling real weight; when it becomes the clear majority, the unsigned Googlebot UA rule deserves a genuine hardness review, since the spoofers will be concentrated in that shrinking unsigned minority. You'll see that crossover in your own logs before any blog post tells you it happened.

The longer-cycle job is re-reading the spec. Re-run the rule inventory every two quarters and re-read Google's crawler documentation when you do, because the directory shape and the covered-components list are the two pieces most likely to shift, and a stale assumption about either is how a working verifier quietly stops verifying. For the foundational Googlebot context, our explainer on what Googlebot is is the right starting point; for the agent-traffic side, how to build an agent-friendly website walks the integration picture.

If you'd rather see which crawlers are actually hitting your site before you write a single rule, run a free SEO audit. It surfaces the bot traffic and verification gaps you'd otherwise only find by grepping access logs at 7am.

FAQ

Is Googlebot itself signing requests yet? Not as of mid-2026. Google's current Web Bot Auth rollout covers the Google-Agent AI-browsing agent; Googlebot, the indexing crawler that drives traditional organic traffic, still authenticates via reverse DNS and the documented Googlebot IP ranges. Plan your rules to treat them separately.

Does Web Bot Auth replace robots.txt? No. They answer different questions. Web Bot Auth attests "this request is really from Google." Robots.txt declares "this URL is or isn't allowed for crawling." Both still apply, and a signed bot that ignores robots is still a robots violator.

What signature algorithm does Web Bot Auth use? RFC 9421 supports several. Cloudflare's documented examples and Google's published directory both use ed25519 (EdDSA over Curve25519). Your verifier needs an ed25519 implementation; that's a single library call in most stacks (Go, Rust, Node, Python all have it).

What happens if Google's public key directory is unreachable? If your cache is fresh, verification continues against the cached keys. If the cache is expired and the fetch fails, fall back to the unsigned-traffic path (reverse DNS, UA, IP); don't block on a transient cache miss, because that's how you accidentally de-index yourself when Google's directory has a hiccup.

Should I drop reverse-DNS verification for Googlebot? Not yet, and probably not in 2026. Reverse DNS is your real defense against UA spoofers claiming to be Googlebot, because Googlebot proper is still unsigned. Re-evaluate quarterly; the right time to tighten the unsigned path is when it's the minority of your real Google traffic, not the majority.