Indexation Limits for Decoupled Sites

Q: Should decoupled sites use noindex on parameterized URLs?

Yes, unless parameters create genuinely unique, high-value content. For faceted navigation and sort parameters, noindex via X-Robots-Tag combined with canonical tags pointing to the canonical base URL is the most reliable approach.

Q: What is the correct HTTP status code for a deleted headless CMS entry?

Return 410 Gone for permanently deleted content. Unlike 404, 410 signals intentional removal and prompts crawlers to drop the URL from the index faster.

Decoupled architectures give you precise control over which URLs exist — but that same flexibility means no CMS-level guard prevents thousands of low-value routes from reaching the index. Without explicit scope management, crawl budget drains on parameterized endpoints, pagination tails, and orphaned CMS entries before Googlebot reaches your money pages.

Prerequisites

Before working through any configuration here, confirm the following are in place:

Framework version: Next.js 14+ (App Router), SvelteKit 2.x, or Nuxt 3.x — middleware APIs differ significantly in earlier releases
CMS access: API access to query published / draft status flags per entry
Environment variables: SITE_URL, CMS_API_KEY, NEXT_PUBLIC_BASE_URL (or framework equivalent) set in both local and CI environments
Tooling: curl (for header inspection), xmllint (sitemap validation), Google Search Console verified and receiving data
GSC baseline: at least 7 days of Crawl Stats data so you can measure before/after changes to bot throughput

How Indexation Scope Breaks Down in Decoupled Architectures

The diagram below maps the three layers where indexation decisions happen — CMS, framework, and CDN/edge — and shows which signals Googlebot reads at each layer.

This three-layer model is the foundation for every configuration below. Crawl budget in headless deployments is the metric that ties all three layers together — changes at any layer register in GSC Crawl Stats within 1–2 weeks.

Step-by-Step Implementation Workflow

Step 1 — Map CMS content types to route patterns

Query your CMS API for every content type and filter to published status before passing slugs to the framework router. This is the single most effective gate: routes that are never generated cannot be crawled.

// lib/cms.ts — fetch only published slugs
export async function getPublishedSlugs(contentType: string): Promise<string[]> {
  const res = await fetch(
    `${process.env.CMS_API_URL}/entries?type=${contentType}&status=published&fields=slug`,
    { headers: { Authorization: `Bearer ${process.env.CMS_API_KEY}` } }
  );
  const data = await res.json();
  return data.items.map((item: { slug: string }) => item.slug);
}

Validation: Run npm run build and count route directories in .next/server/app/ (Next.js) or build/ (SvelteKit). The total must equal your published entry count — no more.

Step 2 — Cap pagination depth in static generation

Unbounded pagination generates O(n) routes from a single content type. Define a hard maximum per content type before the build phase. The ISR vs SSG vs CSR Routing pattern you choose affects whether this cap is enforced at build time or request time.

// app/blog/page/[page]/page.tsx — Next.js App Router
export async function generateStaticParams() {
  const MAX_STATIC_PAGES = 50;
  return Array.from({ length: MAX_STATIC_PAGES }, (_, i) => ({
    page: String(i + 1),
  }));
}

export const dynamicParams = false; // Return 404 beyond page 50

Validation: Request /blog/page/51 — confirm a clean 404 response with curl -o /dev/null -sw "%{http_code}" https://example.com/blog/page/51.

Step 3 — Inject `X-Robots-Tag` for parameterized URLs

Query parameters for filtering, sorting, and session state create URL variants with no unique indexable value. Suppress them at the middleware level — not inside page components — so the header fires regardless of rendering strategy.

Step 4 — Emit `410 Gone` for deleted content

When a CMS entry is permanently deleted, the framework must return 410 (not 404) so crawlers remove the URL from the index promptly rather than treating it as a soft error.

Step 5 — Partition and submit the sitemap

Split the XML sitemap at 50,000 URLs per file and publish a sitemap_index.xml. Submit only the index file to GSC. For automated generation patterns, see Setting Up Dynamic Sitemaps for Composable CMS.

Framework-Specific Code Examples

Next.js App Router — Middleware-level `X-Robots-Tag`

// middleware.ts
import { NextRequest, NextResponse } from 'next/server';

const LOW_VALUE_PARAMS = ['filter', 'sort', 'color', 'size', 'page'];

export function middleware(req: NextRequest) {
  const { searchParams } = req.nextUrl;
  const hasLowValueParam = LOW_VALUE_PARAMS.some((p) => searchParams.has(p));

  if (hasLowValueParam) {
    const res = NextResponse.next();
    res.headers.set('X-Robots-Tag', 'noindex, follow');
    return res;
  }
  return NextResponse.next();
}

export const config = {
  matcher: ['/((?!_next|favicon.ico|api).*)'],
};

SEO impact: Suppresses parameter variants across the entire application in one place. No per-page noindex meta tags are needed, and link equity continues to flow through followed links.

Validation: curl -I "https://example.com/products?filter=red" — confirm x-robots-tag: noindex, follow in the response.

SvelteKit — Handle hook for parameterized noindex + 410 for deleted content

// src/hooks.server.ts
import type { Handle } from '@sveltejs/kit';
import { deletedSlugs } from '$lib/deleted-slugs';

export const handle: Handle = async ({ event, resolve }) => {
  // Return 410 for permanently deleted slugs
  const path = event.url.pathname;
  if (deletedSlugs.has(path)) {
    return new Response('Gone', { status: 410 });
  }

  const response = await resolve(event);

  // noindex for sort/filter params
  const hasParam = ['sort', 'filter', 'q'].some((p) =>
    event.url.searchParams.has(p)
  );
  if (hasParam) {
    response.headers.set('X-Robots-Tag', 'noindex, follow');
  }

  return response;
};

SEO impact: Centralises both suppression and removal signaling in a single server hook. Crawlers receive 410 on deleted slugs and noindex on parameter variants without any component-level logic.

Validation: curl -I "https://example.com/blog/old-post" returns HTTP/2 410. curl -I "https://example.com/products?sort=price" includes x-robots-tag: noindex, follow.

Nuxt — Server middleware for route-level `X-Robots-Tag`

// server/middleware/indexation.ts
import { defineEventHandler, getQuery, setHeader } from 'h3';

const SUPPRESSED_PARAMS = ['filter', 'sort', 'session', 'ref'];

export default defineEventHandler((event) => {
  const query = getQuery(event);
  const hasLowValueParam = SUPPRESSED_PARAMS.some((p) => p in query);

  if (hasLowValueParam) {
    setHeader(event, 'X-Robots-Tag', 'noindex, follow');
  }
});

SEO impact: h3 middleware fires before Nitro renders the page, keeping the suppression layer outside Vue component lifecycle. Works identically across SSR, SSG, and edge-deployed Nuxt builds.

Validation: curl -I "https://example.com/shop?filter=sale" — x-robots-tag: noindex, follow must appear before any content-type header.

HTTP Headers & CDN Directives Reference

Header	Required value	Rationale
`X-Robots-Tag`	`noindex, follow`	Suppresses parameterized or low-value URLs without breaking link flow
`Cache-Control` (static routes)	`public, max-age=2592000, stale-while-revalidate=86400`	Lets CDN serve cached responses to bots, reducing TTFB
`Cache-Control` (ISR routes)	`public, s-maxage=3600, stale-while-revalidate=86400`	Signals freshness window; align with expected crawl frequency
`Cache-Control` (noindex pages)	`no-store`	Prevents CDN from caching suppressed content
`Strict-Transport-Security`	`max-age=31536000; includeSubDomains`	Prevents redirect overhead on every bot request

For the full set of CDN and edge caching directives for SEO, including Cloudflare page rules and Fastly VCL examples, see that dedicated reference.

Validation Protocol

Run these checks before and after each deployment to confirm correct indexation scope.

# 1. Confirm noindex on parameterized URL
curl -I "https://example.com/products?filter=red" | grep -i x-robots-tag
# Expected: x-robots-tag: noindex, follow

# 2. Confirm 410 on deleted slug
curl -o /dev/null -sw "%{http_code}\n" "https://example.com/deleted-page"
# Expected: 410

# 3. Validate sitemap index XML
curl -s "https://example.com/sitemap_index.xml" | xmllint --noout -
# Expected: no errors

# 4. Confirm static page Cache-Control
curl -I "https://example.com/blog/my-post" | grep -i cache-control
# Expected: public, max-age=...

# 5. Count generated routes (Next.js)
find .next/server/app -name "*.html" | wc -l

In Google Search Console, check:

Index Coverage → Not indexed → Crawled – currently not indexed: should decrease for parameter variants
Crawl Stats → By response: 410 count should match your deleted slug count
Sitemap status: all submitted partition sitemaps show “Success”

Lighthouse CI threshold: Googlebot-simulated TTFB should remain below 200 ms across all route tiers.

Troubleshooting

Symptom	Root cause	Fix
Parameterized URLs appearing in GSC Index Coverage	`X-Robots-Tag` added in component `<head>`, not HTTP response	Move to server middleware; verify with `curl -I`
Deleted pages still in GSC as 404, not 410	Framework returns 404 for unknown slugs by default	Maintain a `deletedSlugs` set in your CMS webhook handler; return `410` explicitly
Pagination routes beyond cap showing as 404 in GSC	`dynamicParams` not set to `false`; fallback renders a blank 404	Set `dynamicParams = false` in Next.js; add `[page]` 404 handler in SvelteKit
Sitemap returning 50,001+ URLs in one file	No partitioning logic in sitemap build script	Split on 50,000-URL boundary; serve via `sitemap_index.xml`
ISR pages indexed with stale content	`s-maxage` longer than crawl frequency for that route tier	Reduce `s-maxage`; use on-demand revalidation via `revalidatePath` for high-priority pages
`noindex` ignored on some parameterized URLs	CDN strips non-standard headers before delivering to bot	Whitelist `X-Robots-Tag` in your CDN header forwarding rules

To address stale route accumulation over time, see Preventing Indexation Bloat in Decoupled Sites for automated audit and cleanup workflows.

Child Pages in This Section

Preventing Indexation Bloat in Decoupled Sites — Automated audit workflows to identify and prune stale, orphaned, and duplicate routes before they erode crawl efficiency.
Setting Up Dynamic Sitemaps for Composable CMS — Build-time and runtime sitemap generation with per-type partitioning and CI/CD integration for large-scale headless deployments.

FAQ

Does Google enforce a strict URL limit for headless sites?

No hard per-site cap is published. Practical crawl depth scales with site authority, server TTFB, and content freshness. Exhausting crawl budget on low-value routes effectively suppresses indexation of high-value pages — which is the real risk, not a theoretical limit.

How does ISR affect indexation thresholds?

ISR reduces origin load but introduces a revalidation window during which crawlers may fetch a stale version. Set stale-while-revalidate intervals shorter than your expected crawl frequency for critical routes. Mismatched intervals cause stale indexation rather than budget waste.

Should decoupled sites use noindex on parameterized URLs?

Yes, unless parameters produce genuinely unique, high-value content. For faceted navigation and sort parameters, noindex via X-Robots-Tag combined with canonical tags pointing to the clean base URL is the most reliable approach.

What is the correct HTTP status code for a deleted headless CMS entry?

Return 410 Gone for permanently deleted content. Unlike 404, 410 signals intentional removal and prompts crawlers to drop the URL from the index faster — typically within one or two crawl cycles.

Part of: Headless Architecture & Rendering Strategy Fundamentals

Crawl Budget Impact in Headless — Measuring and managing bot throughput across route tiers
ISR vs SSG vs CSR Routing — How rendering strategy choice affects route generation limits
Edge Caching Behavior for SEO — CDN header configuration and its impact on crawler TTFB
XML Sitemap Generation for Headless — Partitioned sitemap generation for large-scale decoupled deployments
Canonical URL Enforcement — Preventing duplicate indexation caused by URL variants