Indexation Limits for Decoupled Sites

Decoupled architectures give you precise control over which URLs exist β€” but that same flexibility means no CMS-level guard prevents thousands of low-value routes from reaching the index. Without explicit scope management, crawl budget drains on parameterized endpoints, pagination tails, and orphaned CMS entries before Googlebot reaches your money pages.

Prerequisites

Before working through any configuration here, confirm the following are in place:

  • Framework version: Next.js 14+ (App Router), SvelteKit 2.x, or Nuxt 3.x β€” middleware APIs differ significantly in earlier releases
  • CMS access: API access to query published / draft status flags per entry
  • Environment variables: SITE_URL, CMS_API_KEY, NEXT_PUBLIC_BASE_URL (or framework equivalent) set in both local and CI environments
  • Tooling: curl (for header inspection), xmllint (sitemap validation), Google Search Console verified and receiving data
  • GSC baseline: at least 7 days of Crawl Stats data so you can measure before/after changes to bot throughput

How Indexation Scope Breaks Down in Decoupled Architectures

The diagram below maps the three layers where indexation decisions happen β€” CMS, framework, and CDN/edge β€” and shows which signals Googlebot reads at each layer.

Indexation scope decision layers in a decoupled architecture Three horizontal layers β€” CMS, Framework, and CDN/Edge β€” showing which signals Googlebot reads at each layer to determine whether a URL is indexed. CMS LAYER FRAMEWORK LAYER CDN / EDGE LAYER published entries β†’ route generation draft / archived β†’ excluded from build deleted entries β†’ 410 Gone getStaticPaths / generateStaticParams capped route manifest Server middleware X-Robots-Tag injection robots.txt + sitemap discovery signals Cache-Control headers bot throughput signal Edge noindex rules param-based suppression TTFB < 200 ms crawl budget efficiency Googlebot reads all three layers

This three-layer model is the foundation for every configuration below. Crawl budget in headless deployments is the metric that ties all three layers together β€” changes at any layer register in GSC Crawl Stats within 1–2 weeks.

Step-by-Step Implementation Workflow

Step 1 β€” Map CMS content types to route patterns

Query your CMS API for every content type and filter to published status before passing slugs to the framework router. This is the single most effective gate: routes that are never generated cannot be crawled.

// lib/cms.ts β€” fetch only published slugs
export async function getPublishedSlugs(contentType: string): Promise<string[]> {
  const res = await fetch(
    `${process.env.CMS_API_URL}/entries?type=${contentType}&status=published&fields=slug`,
    { headers: { Authorization: `Bearer ${process.env.CMS_API_KEY}` } }
  );
  const data = await res.json();
  return data.items.map((item: { slug: string }) => item.slug);
}

Validation: Run npm run build and count route directories in .next/server/app/ (Next.js) or build/ (SvelteKit). The total must equal your published entry count β€” no more.

Step 2 β€” Cap pagination depth in static generation

Unbounded pagination generates O(n) routes from a single content type. Define a hard maximum per content type before the build phase. The ISR vs SSG vs CSR Routing pattern you choose affects whether this cap is enforced at build time or request time.

// app/blog/page/[page]/page.tsx β€” Next.js App Router
export async function generateStaticParams() {
  const MAX_STATIC_PAGES = 50;
  return Array.from({ length: MAX_STATIC_PAGES }, (_, i) => ({
    page: String(i + 1),
  }));
}

export const dynamicParams = false; // Return 404 beyond page 50

Validation: Request /blog/page/51 β€” confirm a clean 404 response with curl -o /dev/null -sw "%{http_code}" https://example.com/blog/page/51.

Step 3 β€” Inject X-Robots-Tag for parameterized URLs

Query parameters for filtering, sorting, and session state create URL variants with no unique indexable value. Suppress them at the middleware level β€” not inside page components β€” so the header fires regardless of rendering strategy.

Step 4 β€” Emit 410 Gone for deleted content

When a CMS entry is permanently deleted, the framework must return 410 (not 404) so crawlers remove the URL from the index promptly rather than treating it as a soft error.

Step 5 β€” Partition and submit the sitemap

Split the XML sitemap at 50,000 URLs per file and publish a sitemap_index.xml. Submit only the index file to GSC. For automated generation patterns, see Setting Up Dynamic Sitemaps for Composable CMS.

Framework-Specific Code Examples

Next.js App Router β€” Middleware-level X-Robots-Tag

// middleware.ts
import { NextRequest, NextResponse } from 'next/server';

const LOW_VALUE_PARAMS = ['filter', 'sort', 'color', 'size', 'page'];

export function middleware(req: NextRequest) {
  const { searchParams } = req.nextUrl;
  const hasLowValueParam = LOW_VALUE_PARAMS.some((p) => searchParams.has(p));

  if (hasLowValueParam) {
    const res = NextResponse.next();
    res.headers.set('X-Robots-Tag', 'noindex, follow');
    return res;
  }
  return NextResponse.next();
}

export const config = {
  matcher: ['/((?!_next|favicon.ico|api).*)'],
};

SEO impact: Suppresses parameter variants across the entire application in one place. No per-page noindex meta tags are needed, and link equity continues to flow through followed links.

Validation: curl -I "https://example.com/products?filter=red" β€” confirm x-robots-tag: noindex, follow in the response.

SvelteKit β€” Handle hook for parameterized noindex + 410 for deleted content

// src/hooks.server.ts
import type { Handle } from '@sveltejs/kit';
import { deletedSlugs } from '$lib/deleted-slugs';

export const handle: Handle = async ({ event, resolve }) => {
  // Return 410 for permanently deleted slugs
  const path = event.url.pathname;
  if (deletedSlugs.has(path)) {
    return new Response('Gone', { status: 410 });
  }

  const response = await resolve(event);

  // noindex for sort/filter params
  const hasParam = ['sort', 'filter', 'q'].some((p) =>
    event.url.searchParams.has(p)
  );
  if (hasParam) {
    response.headers.set('X-Robots-Tag', 'noindex, follow');
  }

  return response;
};

SEO impact: Centralises both suppression and removal signaling in a single server hook. Crawlers receive 410 on deleted slugs and noindex on parameter variants without any component-level logic.

Validation: curl -I "https://example.com/blog/old-post" returns HTTP/2 410. curl -I "https://example.com/products?sort=price" includes x-robots-tag: noindex, follow.

Nuxt β€” Server middleware for route-level X-Robots-Tag

// server/middleware/indexation.ts
import { defineEventHandler, getQuery, setHeader } from 'h3';

const SUPPRESSED_PARAMS = ['filter', 'sort', 'session', 'ref'];

export default defineEventHandler((event) => {
  const query = getQuery(event);
  const hasLowValueParam = SUPPRESSED_PARAMS.some((p) => p in query);

  if (hasLowValueParam) {
    setHeader(event, 'X-Robots-Tag', 'noindex, follow');
  }
});

SEO impact: h3 middleware fires before Nitro renders the page, keeping the suppression layer outside Vue component lifecycle. Works identically across SSR, SSG, and edge-deployed Nuxt builds.

Validation: curl -I "https://example.com/shop?filter=sale" β€” x-robots-tag: noindex, follow must appear before any content-type header.

HTTP Headers & CDN Directives Reference

Header Required value Rationale
X-Robots-Tag noindex, follow Suppresses parameterized or low-value URLs without breaking link flow
Cache-Control (static routes) public, max-age=2592000, stale-while-revalidate=86400 Lets CDN serve cached responses to bots, reducing TTFB
Cache-Control (ISR routes) public, s-maxage=3600, stale-while-revalidate=86400 Signals freshness window; align with expected crawl frequency
Cache-Control (noindex pages) no-store Prevents CDN from caching suppressed content
Strict-Transport-Security max-age=31536000; includeSubDomains Prevents redirect overhead on every bot request

For the full set of CDN and edge caching directives for SEO, including Cloudflare page rules and Fastly VCL examples, see that dedicated reference.

Validation Protocol

Run these checks before and after each deployment to confirm correct indexation scope.

# 1. Confirm noindex on parameterized URL
curl -I "https://example.com/products?filter=red" | grep -i x-robots-tag
# Expected: x-robots-tag: noindex, follow

# 2. Confirm 410 on deleted slug
curl -o /dev/null -sw "%{http_code}\n" "https://example.com/deleted-page"
# Expected: 410

# 3. Validate sitemap index XML
curl -s "https://example.com/sitemap_index.xml" | xmllint --noout -
# Expected: no errors

# 4. Confirm static page Cache-Control
curl -I "https://example.com/blog/my-post" | grep -i cache-control
# Expected: public, max-age=...

# 5. Count generated routes (Next.js)
find .next/server/app -name "*.html" | wc -l

In Google Search Console, check:

  • Index Coverage β†’ Not indexed β†’ Crawled – currently not indexed: should decrease for parameter variants
  • Crawl Stats β†’ By response: 410 count should match your deleted slug count
  • Sitemap status: all submitted partition sitemaps show β€œSuccess”

Lighthouse CI threshold: Googlebot-simulated TTFB should remain below 200 ms across all route tiers.

Troubleshooting

Symptom Root cause Fix
Parameterized URLs appearing in GSC Index Coverage X-Robots-Tag added in component <head>, not HTTP response Move to server middleware; verify with curl -I
Deleted pages still in GSC as 404, not 410 Framework returns 404 for unknown slugs by default Maintain a deletedSlugs set in your CMS webhook handler; return 410 explicitly
Pagination routes beyond cap showing as 404 in GSC dynamicParams not set to false; fallback renders a blank 404 Set dynamicParams = false in Next.js; add [page] 404 handler in SvelteKit
Sitemap returning 50,001+ URLs in one file No partitioning logic in sitemap build script Split on 50,000-URL boundary; serve via sitemap_index.xml
ISR pages indexed with stale content s-maxage longer than crawl frequency for that route tier Reduce s-maxage; use on-demand revalidation via revalidatePath for high-priority pages
noindex ignored on some parameterized URLs CDN strips non-standard headers before delivering to bot Whitelist X-Robots-Tag in your CDN header forwarding rules

To address stale route accumulation over time, see Preventing Indexation Bloat in Decoupled Sites for automated audit and cleanup workflows.

Child Pages in This Section

FAQ

Does Google enforce a strict URL limit for headless sites?

No hard per-site cap is published. Practical crawl depth scales with site authority, server TTFB, and content freshness. Exhausting crawl budget on low-value routes effectively suppresses indexation of high-value pages β€” which is the real risk, not a theoretical limit.

How does ISR affect indexation thresholds?

ISR reduces origin load but introduces a revalidation window during which crawlers may fetch a stale version. Set stale-while-revalidate intervals shorter than your expected crawl frequency for critical routes. Mismatched intervals cause stale indexation rather than budget waste.

Should decoupled sites use noindex on parameterized URLs?

Yes, unless parameters produce genuinely unique, high-value content. For faceted navigation and sort parameters, noindex via X-Robots-Tag combined with canonical tags pointing to the clean base URL is the most reliable approach.

What is the correct HTTP status code for a deleted headless CMS entry?

Return 410 Gone for permanently deleted content. Unlike 404, 410 signals intentional removal and prompts crawlers to drop the URL from the index faster β€” typically within one or two crawl cycles.


Part of: Headless Architecture & Rendering Strategy Fundamentals