Indexation Limits for Decoupled Sites
Decoupled architectures give you precise control over which URLs exist β but that same flexibility means no CMS-level guard prevents thousands of low-value routes from reaching the index. Without explicit scope management, crawl budget drains on parameterized endpoints, pagination tails, and orphaned CMS entries before Googlebot reaches your money pages.
Prerequisites
Before working through any configuration here, confirm the following are in place:
- Framework version: Next.js 14+ (App Router), SvelteKit 2.x, or Nuxt 3.x β middleware APIs differ significantly in earlier releases
- CMS access: API access to query
published/draftstatus flags per entry - Environment variables:
SITE_URL,CMS_API_KEY,NEXT_PUBLIC_BASE_URL(or framework equivalent) set in both local and CI environments - Tooling:
curl(for header inspection),xmllint(sitemap validation), Google Search Console verified and receiving data - GSC baseline: at least 7 days of Crawl Stats data so you can measure before/after changes to bot throughput
How Indexation Scope Breaks Down in Decoupled Architectures
The diagram below maps the three layers where indexation decisions happen β CMS, framework, and CDN/edge β and shows which signals Googlebot reads at each layer.
This three-layer model is the foundation for every configuration below. Crawl budget in headless deployments is the metric that ties all three layers together β changes at any layer register in GSC Crawl Stats within 1β2 weeks.
Step-by-Step Implementation Workflow
Step 1 β Map CMS content types to route patterns
Query your CMS API for every content type and filter to published status before passing slugs to the framework router. This is the single most effective gate: routes that are never generated cannot be crawled.
// lib/cms.ts β fetch only published slugs
export async function getPublishedSlugs(contentType: string): Promise<string[]> {
const res = await fetch(
`${process.env.CMS_API_URL}/entries?type=${contentType}&status=published&fields=slug`,
{ headers: { Authorization: `Bearer ${process.env.CMS_API_KEY}` } }
);
const data = await res.json();
return data.items.map((item: { slug: string }) => item.slug);
}
Validation: Run npm run build and count route directories in .next/server/app/ (Next.js) or build/ (SvelteKit). The total must equal your published entry count β no more.
Step 2 β Cap pagination depth in static generation
Unbounded pagination generates O(n) routes from a single content type. Define a hard maximum per content type before the build phase. The ISR vs SSG vs CSR Routing pattern you choose affects whether this cap is enforced at build time or request time.
// app/blog/page/[page]/page.tsx β Next.js App Router
export async function generateStaticParams() {
const MAX_STATIC_PAGES = 50;
return Array.from({ length: MAX_STATIC_PAGES }, (_, i) => ({
page: String(i + 1),
}));
}
export const dynamicParams = false; // Return 404 beyond page 50
Validation: Request /blog/page/51 β confirm a clean 404 response with curl -o /dev/null -sw "%{http_code}" https://example.com/blog/page/51.
Step 3 β Inject X-Robots-Tag for parameterized URLs
Query parameters for filtering, sorting, and session state create URL variants with no unique indexable value. Suppress them at the middleware level β not inside page components β so the header fires regardless of rendering strategy.
Step 4 β Emit 410 Gone for deleted content
When a CMS entry is permanently deleted, the framework must return 410 (not 404) so crawlers remove the URL from the index promptly rather than treating it as a soft error.
Step 5 β Partition and submit the sitemap
Split the XML sitemap at 50,000 URLs per file and publish a sitemap_index.xml. Submit only the index file to GSC. For automated generation patterns, see Setting Up Dynamic Sitemaps for Composable CMS.
Framework-Specific Code Examples
Next.js App Router β Middleware-level X-Robots-Tag
// middleware.ts
import { NextRequest, NextResponse } from 'next/server';
const LOW_VALUE_PARAMS = ['filter', 'sort', 'color', 'size', 'page'];
export function middleware(req: NextRequest) {
const { searchParams } = req.nextUrl;
const hasLowValueParam = LOW_VALUE_PARAMS.some((p) => searchParams.has(p));
if (hasLowValueParam) {
const res = NextResponse.next();
res.headers.set('X-Robots-Tag', 'noindex, follow');
return res;
}
return NextResponse.next();
}
export const config = {
matcher: ['/((?!_next|favicon.ico|api).*)'],
};
SEO impact: Suppresses parameter variants across the entire application in one place. No per-page noindex meta tags are needed, and link equity continues to flow through followed links.
Validation: curl -I "https://example.com/products?filter=red" β confirm x-robots-tag: noindex, follow in the response.
SvelteKit β Handle hook for parameterized noindex + 410 for deleted content
// src/hooks.server.ts
import type { Handle } from '@sveltejs/kit';
import { deletedSlugs } from '$lib/deleted-slugs';
export const handle: Handle = async ({ event, resolve }) => {
// Return 410 for permanently deleted slugs
const path = event.url.pathname;
if (deletedSlugs.has(path)) {
return new Response('Gone', { status: 410 });
}
const response = await resolve(event);
// noindex for sort/filter params
const hasParam = ['sort', 'filter', 'q'].some((p) =>
event.url.searchParams.has(p)
);
if (hasParam) {
response.headers.set('X-Robots-Tag', 'noindex, follow');
}
return response;
};
SEO impact: Centralises both suppression and removal signaling in a single server hook. Crawlers receive 410 on deleted slugs and noindex on parameter variants without any component-level logic.
Validation: curl -I "https://example.com/blog/old-post" returns HTTP/2 410. curl -I "https://example.com/products?sort=price" includes x-robots-tag: noindex, follow.
Nuxt β Server middleware for route-level X-Robots-Tag
// server/middleware/indexation.ts
import { defineEventHandler, getQuery, setHeader } from 'h3';
const SUPPRESSED_PARAMS = ['filter', 'sort', 'session', 'ref'];
export default defineEventHandler((event) => {
const query = getQuery(event);
const hasLowValueParam = SUPPRESSED_PARAMS.some((p) => p in query);
if (hasLowValueParam) {
setHeader(event, 'X-Robots-Tag', 'noindex, follow');
}
});
SEO impact: h3 middleware fires before Nitro renders the page, keeping the suppression layer outside Vue component lifecycle. Works identically across SSR, SSG, and edge-deployed Nuxt builds.
Validation: curl -I "https://example.com/shop?filter=sale" β x-robots-tag: noindex, follow must appear before any content-type header.
HTTP Headers & CDN Directives Reference
| Header | Required value | Rationale |
|---|---|---|
X-Robots-Tag |
noindex, follow |
Suppresses parameterized or low-value URLs without breaking link flow |
Cache-Control (static routes) |
public, max-age=2592000, stale-while-revalidate=86400 |
Lets CDN serve cached responses to bots, reducing TTFB |
Cache-Control (ISR routes) |
public, s-maxage=3600, stale-while-revalidate=86400 |
Signals freshness window; align with expected crawl frequency |
Cache-Control (noindex pages) |
no-store |
Prevents CDN from caching suppressed content |
Strict-Transport-Security |
max-age=31536000; includeSubDomains |
Prevents redirect overhead on every bot request |
For the full set of CDN and edge caching directives for SEO, including Cloudflare page rules and Fastly VCL examples, see that dedicated reference.
Validation Protocol
Run these checks before and after each deployment to confirm correct indexation scope.
# 1. Confirm noindex on parameterized URL
curl -I "https://example.com/products?filter=red" | grep -i x-robots-tag
# Expected: x-robots-tag: noindex, follow
# 2. Confirm 410 on deleted slug
curl -o /dev/null -sw "%{http_code}\n" "https://example.com/deleted-page"
# Expected: 410
# 3. Validate sitemap index XML
curl -s "https://example.com/sitemap_index.xml" | xmllint --noout -
# Expected: no errors
# 4. Confirm static page Cache-Control
curl -I "https://example.com/blog/my-post" | grep -i cache-control
# Expected: public, max-age=...
# 5. Count generated routes (Next.js)
find .next/server/app -name "*.html" | wc -l
In Google Search Console, check:
- Index Coverage β Not indexed β Crawled β currently not indexed: should decrease for parameter variants
- Crawl Stats β By response:
410count should match your deleted slug count - Sitemap status: all submitted partition sitemaps show βSuccessβ
Lighthouse CI threshold: Googlebot-simulated TTFB should remain below 200 ms across all route tiers.
Troubleshooting
| Symptom | Root cause | Fix |
|---|---|---|
| Parameterized URLs appearing in GSC Index Coverage | X-Robots-Tag added in component <head>, not HTTP response |
Move to server middleware; verify with curl -I |
| Deleted pages still in GSC as 404, not 410 | Framework returns 404 for unknown slugs by default | Maintain a deletedSlugs set in your CMS webhook handler; return 410 explicitly |
| Pagination routes beyond cap showing as 404 in GSC | dynamicParams not set to false; fallback renders a blank 404 |
Set dynamicParams = false in Next.js; add [page] 404 handler in SvelteKit |
| Sitemap returning 50,001+ URLs in one file | No partitioning logic in sitemap build script | Split on 50,000-URL boundary; serve via sitemap_index.xml |
| ISR pages indexed with stale content | s-maxage longer than crawl frequency for that route tier |
Reduce s-maxage; use on-demand revalidation via revalidatePath for high-priority pages |
noindex ignored on some parameterized URLs |
CDN strips non-standard headers before delivering to bot | Whitelist X-Robots-Tag in your CDN header forwarding rules |
To address stale route accumulation over time, see Preventing Indexation Bloat in Decoupled Sites for automated audit and cleanup workflows.
Child Pages in This Section
- Preventing Indexation Bloat in Decoupled Sites β Automated audit workflows to identify and prune stale, orphaned, and duplicate routes before they erode crawl efficiency.
- Setting Up Dynamic Sitemaps for Composable CMS β Build-time and runtime sitemap generation with per-type partitioning and CI/CD integration for large-scale headless deployments.
FAQ
Does Google enforce a strict URL limit for headless sites?
No hard per-site cap is published. Practical crawl depth scales with site authority, server TTFB, and content freshness. Exhausting crawl budget on low-value routes effectively suppresses indexation of high-value pages β which is the real risk, not a theoretical limit.
How does ISR affect indexation thresholds?
ISR reduces origin load but introduces a revalidation window during which crawlers may fetch a stale version. Set stale-while-revalidate intervals shorter than your expected crawl frequency for critical routes. Mismatched intervals cause stale indexation rather than budget waste.
Should decoupled sites use noindex on parameterized URLs?
Yes, unless parameters produce genuinely unique, high-value content. For faceted navigation and sort parameters, noindex via X-Robots-Tag combined with canonical tags pointing to the clean base URL is the most reliable approach.
What is the correct HTTP status code for a deleted headless CMS entry?
Return 410 Gone for permanently deleted content. Unlike 404, 410 signals intentional removal and prompts crawlers to drop the URL from the index faster β typically within one or two crawl cycles.
Part of: Headless Architecture & Rendering Strategy Fundamentals
Related
- Crawl Budget Impact in Headless β Measuring and managing bot throughput across route tiers
- ISR vs SSG vs CSR Routing β How rendering strategy choice affects route generation limits
- Edge Caching Behavior for SEO β CDN header configuration and its impact on crawler TTFB
- XML Sitemap Generation for Headless β Partitioned sitemap generation for large-scale decoupled deployments
- Canonical URL Enforcement β Preventing duplicate indexation caused by URL variants