TL;DR: If your site has fewer than 10,000 pages, crawl budget is almost certainly not your problem. Stop optimizing for it. But if you run an ecommerce store with 500K product pages, a classifieds site with infinite URL parameters, or anything with faceted navigation — crawl budget is quietly killing your indexing. This guide covers how to diagnose whether you actually have a crawl budget problem, and how to fix it if you do. The answer is usually boring: faster servers, cleaner URLs, better robots.txt.

Your actual crawl budget is the smaller of these two. If Google really wants to crawl 50,000 of your pages today (high demand), but your server can only handle 5,000 fetches without degrading (low rate limit), you get 5,000. If your server can handle 100,000 fetches but Google only cares about 2,000 of your pages (low demand), you get 2,000.
This is the part that most guides get wrong. They treat crawl budget like a fixed pool you need to "save" by blocking unimportant pages. In reality, it's dynamic, it changes daily, and for most sites, it's not the bottleneck at all.
I need to say this clearly because I've watched agencies sell crawl budget optimization to sites with 200 pages.
If your site has fewer than about 10,000 unique URLs, crawl budget optimization is almost certainly a waste of your time.
Gary Illyes has said this himself, multiple times, including at Google I/O and on Twitter. His exact framing: "If your site has a few thousand URLs, most of the time it will be crawled efficiently." Martin Splitt, Google's Developer Advocate, echoed this in a JavaScript SEO office hours episode when he said that crawl budget only becomes a real concern "once you get into the tens of thousands of pages or more."
Google crawls billions of pages per day. Your 500-page WordPress blog is a rounding error. Google will crawl all of it within days of any change, without you doing anything special.
Where crawl budget actually matters:
If none of those describe you, skip to the FAQ section at the bottom and move on with your life. I'm serious. Spend your time on content quality and internal linking instead. I still think this is something that 90% of people doing "crawl budget optimization" should hear: your problem is probably somewhere else entirely.

2. Check your indexing gap. Compare the number of pages in your sitemap against the number of indexed pages in GSC's "Pages" report. If you have 100,000 URLs in your sitemap but only 40,000 indexed, something is consuming your crawl budget before it gets to the pages that matter.
3. Look at server logs. This is the real diagnostic. GSC gives you aggregated data. Server logs give you the truth — every single request Googlebot made, when, to what URL, and what response it got. If you see Googlebot spending 60% of its crawl on paginated archive pages or filtered URLs, that's your problem, in black and white.
I'll be honest about a limitation here: I'm not confident that the GSC crawl stats report is always accurate. We've seen discrepancies between what GSC reports and what our customers' server logs show. Sometimes significant discrepancies — 30-40% gaps. I don't know if that's a sampling issue on Google's side, a caching artifact, or something else. So I always recommend verifying with server logs if the stakes are high.
| Diagnostic Signal | Healthy | Warning | Critical |
|---|---|---|---|
| New page indexed within | 1-3 days | 1-2 weeks | 4+ weeks or never |
| GSC crawl requests/day vs total pages | > 50% of pages crawled per week | 10-50% per week | < 10% per week |
| Average server response time | < 200ms | 200-500ms | > 500ms |
| % of crawl on non-indexable URLs | < 10% | 10-30% | > 30% |
| Redirect chains in crawl | None | < 5% of requests | > 5% hit chains |
| 5xx error rate during crawl | 0% | < 1% | > 1% |
Note: These thresholds are experience-based guidelines drawn from patterns across SEOJuice customer data, not official figures published by Google. Your mileage may vary depending on site size, niche, and server architecture.
If most of your signals are in the "Healthy" column, you don't have a crawl budget problem. Go optimize something else.
This is the crawl budget factor that has the highest impact and gets the least attention. Everyone wants to talk about robots.txt and sitemaps. Nobody wants to talk about why their server takes 1.2 seconds to respond to a simple HTML request.
Googlebot is polite. It monitors your server's response time in real time. If your server starts slowing down, Googlebot reduces its crawl rate to avoid overloading you. This is the crawl rate limit in action. A server that responds in 100ms will get crawled dramatically more than one that responds in 800ms.
"If the site is really fast, Googlebot will be able to use more connections and crawl the site faster. If the site slows down or responds with server errors, it will slow down and crawl less."
— Gary Illyes, Senior Search Analyst, Google (Google Developers Blog)
That's a direct quote from the official crawl budget blog post. "Really fast" to Google means sub-200ms time to first byte (TTFB). Not page load time — TTFB. The time it takes your server to start sending the HTML response.
Quick wins for server response time:
On one SEOJuice customer's site (a furniture ecommerce store, roughly 80K product pages), we watched their crawl rate in GSC drop from 15,000 requests/day to 3,000 over two weeks. No changes to content or structure. The cause? Their hosting provider migrated them to a new server cluster and TTFB went from 180ms to 900ms. Once they fixed the hosting, crawl rate recovered within four days. No robots.txt changes. No sitemap updates. Just faster servers.
URL parameters are the single most common source of crawl waste. And the problem is insidious because you often don't know it's happening.
Consider an ecommerce site with filtering. A user browses shoes and selects: size 10, color black, brand Nike, sorted by price, page 2. That's a URL like:
/shoes?size=10&color=black&brand=nike&sort=price&page=2
Now multiply that by every possible combination. 8 sizes, 12 colors, 40 brands, 4 sort options, 50 pages of results. That's 8 × 12 × 40 × 4 × 50 = 768,000 URLs. From one category page. And the content on most of those pages overlaps significantly — size 10 black Nike shoes sorted by price is mostly the same products as size 10 black Nike shoes sorted by newest.
Googlebot doesn't know that. It sees 768,000 unique URLs and starts crawling them. Your actual product pages — the ones that should rank — sit in a queue behind hundreds of thousands of filtered variations that nobody will ever search for.
This is what people mean by "faceted navigation creating crawl traps." It's not that Google gets stuck in an infinite loop (though that can happen with certain pagination setups). It's that Google allocates its limited crawl budget to URLs that provide no unique value.
I want to be precise about something here: the URL parameter tool in Google Search Console was deprecated and removed in 2022. Google used to let you tell it which parameters to ignore. That option is gone. You now have three tools to handle this:
Each has tradeoffs. I'll cover robots.txt and canonicals in their own sections below.
Your robots.txt is the first file Googlebot checks before crawling your site. It's also the most misunderstood file in SEO. People either leave it empty (missing an opportunity) or go overboard (blocking things they shouldn't).
Here's the key principle: block things that waste crawl budget, not things that are "unimportant." There's a difference. An "unimportant" page might still need to be indexed. A page that wastes crawl budget is one that provides no unique value to search and exists in thousands of parameter variations.
# Block faceted navigation parameters
User-agent: *
Disallow: /*?sort=
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*&sort=
Disallow: /*&color=
Disallow: /*&size=
# Block internal search results
Disallow: /search?
Disallow: /search/
# Block session-based URLs
Disallow: /*?sessionid=
Disallow: /*?ref=
# Block admin, cart, and account pages
Disallow: /admin/
Disallow: /cart/
Disallow: /my-account/
Disallow: /checkout/
# Block print and PDF versions
Disallow: /*?print=
Disallow: /*?format=pdf
# DO NOT block CSS, JS, or images
# Googlebot needs these to render your pages
Allow: /*.css
Allow: /*.js
Allow: /*.jpg
Allow: /*.png
Allow: /*.webp
Sitemap: https://example.com/sitemap.xml
Critical mistakes I've seen:
/products/ because they want to block /products?filter= and accidentally deindexes their entire catalog.That last point is worth repeating because it trips up experienced SEOs too. Robots.txt blocks crawling. It does not block indexing. If you want to prevent indexing, use <meta name="robots" content="noindex"> or an X-Robots-Tag HTTP header. But remember: for Google to see a noindex tag, it first has to crawl the page. So if you block crawling with robots.txt AND add noindex, Google will never see the noindex tag. This creates a paradox that has confused people for years.
A sitemap doesn't guarantee indexing. It doesn't even guarantee crawling. What it does is give Google a hint about which URLs exist, when they were last modified, and (debatably) how important they are relative to each other.
The mistakes people make with sitemaps are almost always about including too much, not too little.
What to include in your sitemap:
<lastmod> dates — not the current date, not the same date on every page, but the actual last modification dateWhat to exclude from your sitemap:
I've seen sitemaps with 500,000 URLs where only 80,000 were actually indexable. The other 420,000 were redirects, noindexed pages, parameter variations, and broken URLs. That sitemap isn't helping Google — it's sending it on a scavenger hunt where 84% of the treasure map is wrong.
Martin Splitt has called <lastmod> "one of the most abused tags in sitemaps" because so many CMS platforms set it to the current date on every page. When every page says "I was just modified," Google learns to ignore the signal entirely. If your CMS doesn't track real modification dates, fix that before worrying about anything else sitemap-related.
I'm giving faceted navigation its own section because it's the intersection of crawl budget, duplicate content, and technical architecture — and getting it wrong can tank a site's SEO silently over months.
The problem: faceted navigation (filters on ecommerce, classifieds, job boards) generates exponential URL combinations. We covered the math above. But the solution isn't as simple as "block everything with robots.txt" because some faceted pages have genuine search value.
Think about it: "Nike running shoes size 10" is a real search query that a real person types into Google. A faceted page matching that query could rank for it. Blocking all faceted URLs means you lose that opportunity.
The framework I recommend (and what we implement for SEOJuice customers who have this problem):
| Facet Type | Example | Search Value | Recommended Approach |
|---|---|---|---|
| Category + Brand | /shoes/nike/ | High — people search for brand+category | Index, include in sitemap, use clean URL |
| Category + 1 filter | /shoes?color=black | Medium — depends on search volume | Check search volume. Index if >100 monthly searches, canonical to parent otherwise |
| Category + 2+ filters | /shoes?color=black&size=10 | Low — too specific for most searches | Canonical to the single most relevant filter or parent category |
| Sort variations | /shoes?sort=price-asc | None — nobody searches for "shoes sorted by price" | Block with robots.txt or noindex |
| Pagination deep pages | /shoes?page=47 | None beyond page 2-3 | Noindex after page 3-5, keep crawlable |
| Session/tracking params | /shoes?utm_source=email | None | Block with robots.txt, strip at server level |
The canonical tag implementation for multi-filter pages looks like this:
<!-- On /shoes?color=black&size=10&sort=price -->
<link rel="canonical" href="https://example.com/shoes?color=black" />
<!-- On /shoes?sort=price -->
<link rel="canonical" href="https://example.com/shoes" />
<!-- On /shoes (the clean category page) -->
<link rel="canonical" href="https://example.com/shoes" />
One mistake I've made and haven't fully resolved: what to do with faceted pages that have accumulated backlinks. A customer had thousands of external links pointing to filtered URLs. Canonicalizing them to the parent should flow equity upward — sounds fine in theory.
In practice, we saw a 15% drop in the parent page's rankings after implementing canonicals. I still don't know why. My best guess is the sudden consolidation of thousands of signals confused Google's evaluation, but that's speculation. We rolled back canonicals on the most-linked filtered pages and left them indexable. It's a compromise I'm not comfortable with.
Short version: rel="next" and rel="prev" are deprecated. Google confirmed in 2019 that they hadn't been using the signal for years. So what do you do instead?
Three options, ranked by my preference:
Option 1: Load-more or infinite scroll with pushState. This is the cleanest approach for new sites. Users see one URL. Google crawls the full content. No pagination URLs to waste crawl budget on. But it requires JavaScript, which introduces its own crawl budget costs (more on that below).
Option 2: Traditional pagination with noindex on page 2+. Keep the paginated URLs crawlable (so Google can discover the products/articles linked from them) but noindex them so Google doesn't try to index identical template pages. The canonical on each paginated page should be self-referencing — don't canonical all pages to page 1, because the content is different.
Option 3: View-all page. If your paginated content totals fewer than ~200 items, consider a single view-all page that canonicalizes the paginated series. Google has historically preferred view-all pages. The downside: page load time. If your view-all page takes 8 seconds to load, it hurts more than it helps.
<!-- Page 2 of blog archive -->
<meta name="robots" content="noindex, follow">
<link rel="canonical" href="https://example.com/blog/page/2" />
<!-- Important: use "noindex, follow" — not "noindex, nofollow"
You want Google to follow the links on paginated pages
to discover the actual content pages -->
Note the follow directive. This is crucial. You don't want the paginated page in the index, but you absolutely want Google to follow the links on it to find your actual content. Using nofollow here would orphan every article or product only linked from page 2+ of your archive.
This section is relevant to anyone running a JavaScript-heavy site (React, Vue, Angular, Next.js without proper SSR). If your site is traditional server-rendered HTML, skip ahead.
Google crawls in two waves. First wave: it downloads and processes the raw HTML. Second wave: it renders the page with a headless Chromium browser to execute JavaScript and see the final content. The second wave happens later — sometimes hours later, sometimes days.
Martin Splitt has explained this extensively in his JavaScript SEO office hours. The key insight: rendering is expensive for Google. It takes more resources than a simple HTML fetch. Google has to spin up a Chromium instance, execute your JavaScript, wait for API calls to resolve, and then process the rendered DOM. This means JavaScript-dependent pages get crawled less efficiently than server-rendered pages.
The crawl budget impact:
The fix: server-side rendering (SSR) or static generation (SSG). Next.js, Nuxt, SvelteKit all support this. If you can't do full SSR, use dynamic rendering: serve pre-rendered HTML to Googlebot and the full JS experience to users. Google technically discourages it, but as of early 2026 it works in practice. We've covered the SPA-specific challenges in our guide to SPA SEO best practices.

What to look for in your logs:
Tooling: Screaming Frog Log Analyzer is the most popular dedicated tool. You can also parse logs with command-line tools (grep for Googlebot user agent, pipe through awk). For ongoing monitoring, SEOJuice's crawler analytics feature processes server logs automatically and flags crawl budget issues.
A real example from our data: one client had a WordPress site with ~25,000 published posts. Good content, decent traffic. But their server logs showed Googlebot spending 40% of its crawl budget on /wp-json/ API endpoints and /feed/ URLs. These aren't pages anyone searches for. Adding two lines to robots.txt freed up that 40% for actual content pages. Within three weeks, their crawl rate on article pages increased by 60%, and they saw 12 new pages indexed that had been sitting in the "Discovered — currently not indexed" purgatory for months.
# WordPress-specific crawl budget savings
User-agent: *
Disallow: /wp-json/
Disallow: /feed/
Disallow: /comments/feed/
Disallow: /author/*/feed/
Disallow: /category/*/feed/
Disallow: /tag/*/feed/
Disallow: /?replytocom=
Disallow: /trackback/
Internal links are the strongest signal you control for telling Google which pages matter. Not meta tags. Not sitemap priority values (which Google ignores anyway). Internal links.
Every link is a vote. A page linked from your homepage, your navigation, and 50 other pages will get crawled more frequently than a page linked from one archive page three clicks deep. This is basic PageRank distribution, and it's how Googlebot decides where to spend its crawl budget.
The practical implication: if pages aren't getting indexed and they're 3+ clicks from your homepage, adding internal links from well-connected pages increases their crawl frequency. We've seen this consistently — pages that go from 0-1 internal links to 5+ get their first Googlebot visit within 48 hours, even on large sites where new pages typically wait weeks.
This is why we built automatic internal linking into SEOJuice. Manually managing internal links across 10,000+ pages is impossible. But the crawl priority benefit makes it one of the highest-ROI technical SEO activities.
Flat architecture (every page reachable in 3 clicks or fewer) is better for crawl budget than deep hierarchies. But "flat" doesn't mean "link everything to everything." It means strategic linking — topic clusters, hub pages, contextual links — that creates clear crawl paths to your most important pages.
I want to be specific about what we actually do, rather than vague marketing-speak.
SEOJuice tracks crawl behavior through three inputs: GSC crawl stats (via the Search Console API), our own crawl data (we crawl customer sites to build the internal link graph), and optionally, server log integration.
What we surface:
We don't do full log file analysis in-product yet (parsing arbitrary server log formats at scale is an engineering challenge I haven't cracked to my satisfaction). But GSC data plus our own crawler gives most sites enough to identify and fix crawl budget issues without touching a log file.
If you've read this far and determined you actually have a crawl budget problem, here's what to fix, in order of impact:
Steps 1-3 solve the problem for the majority of sites that actually need crawl budget optimization. Steps 4-9 are for the rest — sites with complex architectures where the basics aren't enough.
No. Crawl budget affects whether your pages get crawled and indexed, not how they rank once indexed. But a page that never gets crawled never gets indexed, and a page that never gets indexed can't rank. So crawl budget is a prerequisite, not a ranking factor. Optimizing it won't improve rankings for already-indexed pages — it helps pages that aren't getting indexed at all.
Almost never. The crawl rate settings in GSC let you reduce Googlebot's crawl rate, not increase it. The only reason to use this is if Googlebot is literally overwhelming your server. If your server can handle the traffic, leave it alone. I've seen people reduce it thinking it would "focus" Google on important pages. It doesn't work that way — Google just crawls less of everything.
Popular, frequently-updated pages: multiple times per day. Static pages unchanged for months: every few weeks. The average is roughly every 1-2 weeks per page, but varies enormously. News sites get crawled within minutes. A rarely-updated "About" page might go a month between visits. The <lastmod> tag can hint that a page changed, but only if you use it accurately — Google ignores it if it's always set to today's date.
You can increase your crawl rate limit (faster server) and reduce crawl waste (so more budget goes to important pages). But you can't directly tell Google to crawl you more. Crawl demand is Google's decision based on perceived content value. The best indirect approach: publish frequently, build backlinks, make content genuinely useful. High-value sites get crawled more aggressively, automatically.
Yes. Google still crawls the page to see the noindex tag. 100,000 noindexed pages means 100,000 crawl budget hits (albeit less frequently than indexed pages). If those pages truly never need indexing, robots.txt is more crawl-efficient — but robots.txt blocks prevent Google from seeing anything on the page, including links. Use noindex+follow when you want link discovery but not indexing. Use robots.txt when you don't want the page crawled at all.
This article is part of our technical SEO series. If you're working through crawl budget issues, these related guides will help:
If you're running a large site and want automated crawl budget monitoring, SEOJuice tracks crawl waste, indexing velocity, and server response time across all your pages. It won't replace log file analysis for complex architectures, but it surfaces the majority of crawl budget issues that matter — continuously, not just when someone remembers to run an audit. Start a free trial and see where your crawl budget is going within minutes.
no credit card required