XML Sitemap Generation for Headless

Q: How do I handle sitemap index splitting for large headless sites?

Implement a sitemap index file (sitemap_index.xml) that references segmented child sitemaps by content type or section. Cap each child file at 50,000 URLs or 50 MB uncompressed to comply with the sitemaps.org protocol.

Q: Does headless architecture require manual robots.txt updates for sitemaps?

No. Generate robots.txt dynamically using the same serverless route or framework handler that builds your sitemap. Inject the correct sitemap URL from an environment variable so it updates automatically across staging and production deployments.

In a headless deployment, your CMS and your rendering layer have no shared filesystem — which means the crawler discovery contract that a monolithic CMS manages automatically becomes your engineering responsibility. This page walks through the full pipeline: fetching routes from a content API, serializing them to a valid XML sitemap, applying CDN caching rules, and validating the result before submission to Search Console.

Prerequisites

Before implementing, confirm the following are in place:

Framework version: Next.js 13+ (App Router), Nuxt 3.x, or Astro 3+
Environment variables: SITE_URL (absolute origin, no trailing slash), CMS_API_URL, and any bearer token for private APIs
Tooling: curl, xmllint (from libxml2), and access to Google Search Console for your property
Slug normalization: all CMS slugs must already be lowercased, hyphenated, and free of trailing slashes before route extraction begins
Dynamic route generation: parameterized routes must be resolved to concrete paths before sitemap serialization

How the Pipeline Fits Together

The diagram below shows how a CMS publish event propagates through the sitemap pipeline to a crawler discovery event.

Step-by-Step Implementation Workflow

Step 1 — Build a Route Manifest from the CMS API

Fetch all published entries in a single paginated loop. Write a flat array of objects before any XML serialization:

// lib/fetchRoutes.ts
interface RouteEntry {
  url: string;
  lastmod: string;
  priority: number;
}

export async function fetchRoutes(): Promise<RouteEntry[]> {
  const base = process.env.CMS_API_URL!;
  const site = process.env.SITE_URL!;
  const entries: RouteEntry[] = [];
  let page = 1;
  let hasMore = true;

  while (hasMore) {
    const res = await fetch(`${base}/entries?status=published&page=${page}&limit=200`);
    const data: { items: Array<{ slug: string; updatedAt: string; type: string }> } = await res.json();
    for (const item of data.items) {
      entries.push({
        url: `${site}/${item.slug}/`,
        lastmod: new Date(item.updatedAt).toISOString(),
        priority: item.type === 'article' ? 0.8 : 0.5,
      });
    }
    hasMore = data.items.length === 200;
    page++;
  }
  return entries;
}

Validation: compare entries.length to the CMS published count via the API dashboard before proceeding.

Step 2 — Filter Non-Indexable Routes

Apply the filter layer shown in the diagram. Strip draft slugs, parameterized paths, and pagination variants before serialization:

// lib/filterRoutes.ts
export function filterRoutes(routes: Array<{ url: string; lastmod: string; priority: number }>) {
  const EXCLUDE = [/\/draft\//, /[?&]/, /\/page\/\d+\/?$/, /\/preview\//];
  return routes.filter((r) => !EXCLUDE.some((re) => re.test(r.url)));
}

This step is required if pagination in headless uses numeric URL suffixes that are not meant to be indexed independently.

Step 3 — Serialize to XML

Once the manifest is clean, serialize it. The sitemaps protocol requires an XML declaration, the urlset namespace, and well-formed <url> nodes.

// lib/buildSitemap.ts
export function buildSitemapXml(routes: Array<{ url: string; lastmod: string; priority: number }>) {
  const nodes = routes
    .map(
      (r) =>
        `  <url>\n    <loc>${r.url}</loc>\n    <lastmod>${r.lastmod}</lastmod>\n    <priority>${r.priority}</priority>\n  </url>`
    )
    .join('\n');
  return [
    '<?xml version="1.0" encoding="UTF-8"?>',
    '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">',
    nodes,
    '</urlset>',
  ].join('\n');
}

Framework-Specific Builders

Next.js App Router — Metadata API

Next.js 13+ exposes a sitemap.ts export that the runtime calls on each request (or on a revalidation interval). This avoids running a full rebuild when new content is published.

// app/sitemap.ts
import type { MetadataRoute } from 'next';
import { fetchRoutes } from '@/lib/fetchRoutes';
import { filterRoutes } from '@/lib/filterRoutes';

export const revalidate = 3600; // ISR: regenerate every hour

export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
  const raw = await fetchRoutes();
  return filterRoutes(raw).map((r) => ({
    url: r.url,
    lastModified: new Date(r.lastmod),
    priority: r.priority,
  }));
}

SEO impact: ISR keeps the sitemap within one revalidation window of publication state. Combined with crawl budget management for headless, this ensures Googlebot is never directed at stale or non-existent URLs.

Validation:

curl -H "Accept: application/xml" https://yourdomain.com/sitemap.xml | xmllint --format -

Nuxt 3 — Nitro Server Route

Nuxt’s Nitro runtime allows you to register a server route at /sitemap.xml without any additional plugin.

// server/routes/sitemap.xml.ts
import { defineEventHandler, setResponseHeader } from 'h3';
import { fetchRoutes } from '~/lib/fetchRoutes';
import { filterRoutes } from '~/lib/filterRoutes';
import { buildSitemapXml } from '~/lib/buildSitemap';

export default defineEventHandler(async (event) => {
  setResponseHeader(event, 'Content-Type', 'application/xml; charset=utf-8');
  setResponseHeader(event, 'Cache-Control', 's-maxage=3600, stale-while-revalidate=86400');
  const raw = await fetchRoutes();
  return buildSitemapXml(filterRoutes(raw));
});

SEO impact: Nitro serves sitemaps with zero client-side overhead, making the endpoint robust during bot traffic spikes without incurring JavaScript execution cost.

Validation:

curl -I https://yourdomain.com/sitemap.xml
# Expect: HTTP/2 200 · Content-Type: application/xml; charset=utf-8

Astro — Sitemap Integration with Content Collections

For Astro sites using Content Collections, the official @astrojs/sitemap integration generates the sitemap at build time from your collection entries. A filter callback excludes draft and staging routes.

// astro.config.mjs
import { defineConfig } from 'astro/config';
import sitemap from '@astrojs/sitemap';

export default defineConfig({
  site: 'https://yoursite.com',
  integrations: [
    sitemap({
      filter: (page) => !page.includes('/draft/') && !page.includes('/preview/'),
      customPages: [], // add any server-rendered paths here
    }),
  ],
});

SEO impact: Automatically excludes non-indexable routes during build. Prevents index bloat from draft, preview, or staging URLs — a common cause of indexation limits in decoupled sites.

Validation:

npm run build
ls dist/sitemap*.xml      # expect sitemap-index.xml + sitemap-0.xml
xmllint --noout dist/sitemap-0.xml && echo "valid XML"

SvelteKit — Server Endpoint

SvelteKit uses file-based routing for server endpoints. A GET handler at src/routes/sitemap.xml/+server.ts serves the sitemap with full control over caching headers.

// src/routes/sitemap.xml/+server.ts
import type { RequestHandler } from './$types';
import { fetchRoutes } from '$lib/fetchRoutes';
import { filterRoutes } from '$lib/filterRoutes';
import { buildSitemapXml } from '$lib/buildSitemap';

export const GET: RequestHandler = async () => {
  const raw = await fetchRoutes();
  const xml = buildSitemapXml(filterRoutes(raw));
  return new Response(xml, {
    headers: {
      'Content-Type': 'application/xml; charset=utf-8',
      'Cache-Control': 's-maxage=3600, stale-while-revalidate=86400',
    },
  });
};

SEO impact: Uses the Fetch API Response directly, which adapts to SvelteKit’s adapter-node or adapter-cloudflare deployment targets without configuration changes.

URL Canonicalization Inside the Sitemap

Every <loc> value must be the exact canonical URL for that page. Before serialization, run the same normalization that canonical URL enforcement applies at the edge:

Strip query strings and UTM parameters
Enforce a trailing slash (or enforce no trailing slash — pick one and be consistent)
Lowercase the path component
Resolve relative URLs to absolute with SITE_URL

Mismatches between <loc> values and the rel="canonical" tag in the page <head> generate conflicting signals. Search engines must agree on which URL to index; your sitemap and your canonical tags must point to the same string.

HTTP Headers & CDN Directives Reference

Header	Required Value	Rationale
`Content-Type`	`application/xml; charset=utf-8`	Required for valid XML MIME type recognition
`Cache-Control`	`s-maxage=3600, stale-while-revalidate=86400`	Allows CDN to cache for 1 hour, serve stale for 24 h during revalidation
`X-Content-Type-Options`	`nosniff`	Prevents MIME sniffing on the XML response
`Vary`	`Accept-Encoding`	Enables Gzip/Brotli negotiation at the CDN layer
`ETag`	Generated hash of sitemap content	Enables conditional GET so crawlers skip unchanged sitemaps

For Cloudflare Pages or Vercel, inject these in your deployment headers config:

{
  "headers": [
    {
      "source": "/sitemap(.*)\\.xml",
      "headers": [
        { "key": "Cache-Control", "value": "s-maxage=3600, stale-while-revalidate=86400" },
        { "key": "X-Content-Type-Options", "value": "nosniff" },
        { "key": "Content-Type", "value": "application/xml; charset=utf-8" }
      ]
    }
  ]
}

Understanding the full picture of edge caching behavior for SEO helps you set revalidation windows that match your content publication frequency.

Sitemap Index Splitting for Large Sites

When a single site exceeds 50,000 URLs or 50 MB uncompressed (the sitemaps.org protocol limits), split the output into a sitemap index:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://yoursite.com/sitemap-articles.xml</loc>
    <lastmod>2026-06-22T00:00:00Z</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://yoursite.com/sitemap-categories.xml</loc>
    <lastmod>2026-06-22T00:00:00Z</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://yoursite.com/sitemap-authors.xml</loc>
    <lastmod>2026-06-22T00:00:00Z</lastmod>
  </sitemap>
</sitemapindex>

Submit sitemap_index.xml (not the child sitemaps) to Google Search Console. The index file itself has no URL limit; each referenced child file is capped at 50,000 entries.

Validation Protocol

Run this sequence before every deployment and after any CMS schema change:

# 1. Confirm HTTP 200 and correct Content-Type
curl -sI https://yourdomain.com/sitemap.xml | grep -E 'HTTP|content-type'

# 2. Validate XML is well-formed (requires libxml2)
curl -s https://yourdomain.com/sitemap.xml -o /tmp/sitemap.xml
xmllint --noout /tmp/sitemap.xml && echo "XML is well-formed"

# 3. Count URLs and compare to CMS published count
grep -c '<loc>' /tmp/sitemap.xml

# 4. Check for stale or draft URLs leaking in
grep -E '/draft/|/preview/|\?' /tmp/sitemap.xml && echo "LEAK FOUND" || echo "Clean"

# 5. Verify robots.txt references the sitemap
curl -s https://yourdomain.com/robots.txt | grep Sitemap

# 6. Submit to Google Search Console via API
curl -X PUT \
  "https://www.googleapis.com/webmasters/v3/sites/https%3A%2F%2Fyourdomain.com%2F/sitemaps/https%3A%2F%2Fyourdomain.com%2Fsitemap.xml" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)"

After submission, monitor the Sitemaps diagnostic panel in Search Console for URL count discrepancies and any XML parsing errors the crawler reports.

Troubleshooting

Symptom	Root Cause	Fix
Sitemap URL count lower than CMS count	Pagination loop exits early	Add `hasMore` logic with a `limit` sentinel; log total from API header
`lastmod` values are all identical	CMS API returns build time, not content update time	Use `item.updatedAt` from the API, not `new Date()` at serialization time
Draft URLs appear in sitemap	Filter runs after serialization	Move `filterRoutes()` call before `buildSitemapXml()` in the pipeline
Crawler reports XML parse error	String interpolation introduced unescaped `&` in URLs	Run URLs through `encodeURI()` before placing in `<loc>`; escape `&` as `&`
CDN serves stale sitemap after CMS publish	Cache not purged on webhook	Add a webhook handler that calls the CDN purge API for `/sitemap*.xml` on every publish event
`Content-Type: text/html` on sitemap endpoint	Framework default MIME type	Explicitly set `Content-Type: application/xml; charset=utf-8` in the route handler
Sitemap not found in robots.txt	Robots generated separately from sitemap config	Generate both from the same environment variable: `Sitemap: ${SITE_URL}/sitemap.xml`

Frequently Asked Questions

Should sitemaps be generated at build time or runtime in headless setups?

Build time suits static sites with infrequent content changes. Runtime generation via ISR or a serverless handler is required for high-velocity CMS environments where new content must be discoverable within minutes of publication, not hours. A mixed approach — ISR with a short revalidation window — works well for most content teams.

How do I handle sitemap index splitting for large headless sites?

Implement a sitemap index file (sitemap_index.xml) that references segmented child sitemaps by content type or section. Cap each child file at 50,000 URLs or 50 MB uncompressed to comply with the sitemaps.org protocol. Submit only the index URL to Search Console.

Does headless architecture require manual robots.txt updates for sitemaps?

No. Generate robots.txt dynamically using the same serverless route or framework handler as your sitemap. Inject the correct sitemap URL from an environment variable — Sitemap: ${SITE_URL}/sitemap.xml — so it updates automatically across staging and production without manual edits.

Part of: Dynamic Routing & Indexation Workflows

Related

Dynamic Route Generation — resolving parameterized slugs to concrete paths before manifest assembly
Slug Normalization Strategies — enforcing consistent URL shapes that match sitemap <loc> values
Canonical URL Enforcement — aligning edge-level canonical headers with sitemap declarations
Pagination Handling in Headless — deciding which paginated URLs belong in the sitemap
Indexation Limits for Decoupled Sites — understanding why sitemap hygiene directly affects crawl quota allocation