XML Sitemap Generation for Headless

In a headless deployment, your CMS and your rendering layer have no shared filesystem β€” which means the crawler discovery contract that a monolithic CMS manages automatically becomes your engineering responsibility. This page walks through the full pipeline: fetching routes from a content API, serializing them to a valid XML sitemap, applying CDN caching rules, and validating the result before submission to Search Console.

Prerequisites

Before implementing, confirm the following are in place:

  • Framework version: Next.js 13+ (App Router), Nuxt 3.x, or Astro 3+
  • Environment variables: SITE_URL (absolute origin, no trailing slash), CMS_API_URL, and any bearer token for private APIs
  • Tooling: curl, xmllint (from libxml2), and access to Google Search Console for your property
  • Slug normalization: all CMS slugs must already be lowercased, hyphenated, and free of trailing slashes before route extraction begins
  • Dynamic route generation: parameterized routes must be resolved to concrete paths before sitemap serialization

How the Pipeline Fits Together

The diagram below shows how a CMS publish event propagates through the sitemap pipeline to a crawler discovery event.

XML Sitemap Generation Pipeline A flow diagram showing how a CMS publish webhook triggers route manifest generation, XML serialization, CDN caching, and finally Googlebot discovery. Headless CMS publish event webhook Route Manifest url Β· lastmod Β· priority serialize XML Builder sitemap.xml / index cache CDN Edge s-maxage=3600 serve Googlebot discovery Filter layer no drafts Β· no ?params

Step-by-Step Implementation Workflow

Step 1 β€” Build a Route Manifest from the CMS API

Fetch all published entries in a single paginated loop. Write a flat array of objects before any XML serialization:

// lib/fetchRoutes.ts
interface RouteEntry {
  url: string;
  lastmod: string;
  priority: number;
}

export async function fetchRoutes(): Promise<RouteEntry[]> {
  const base = process.env.CMS_API_URL!;
  const site = process.env.SITE_URL!;
  const entries: RouteEntry[] = [];
  let page = 1;
  let hasMore = true;

  while (hasMore) {
    const res = await fetch(`${base}/entries?status=published&page=${page}&limit=200`);
    const data: { items: Array<{ slug: string; updatedAt: string; type: string }> } = await res.json();
    for (const item of data.items) {
      entries.push({
        url: `${site}/${item.slug}/`,
        lastmod: new Date(item.updatedAt).toISOString(),
        priority: item.type === 'article' ? 0.8 : 0.5,
      });
    }
    hasMore = data.items.length === 200;
    page++;
  }
  return entries;
}

Validation: compare entries.length to the CMS published count via the API dashboard before proceeding.


Step 2 β€” Filter Non-Indexable Routes

Apply the filter layer shown in the diagram. Strip draft slugs, parameterized paths, and pagination variants before serialization:

// lib/filterRoutes.ts
export function filterRoutes(routes: Array<{ url: string; lastmod: string; priority: number }>) {
  const EXCLUDE = [/\/draft\//, /[?&]/, /\/page\/\d+\/?$/, /\/preview\//];
  return routes.filter((r) => !EXCLUDE.some((re) => re.test(r.url)));
}

This step is required if pagination in headless uses numeric URL suffixes that are not meant to be indexed independently.


Step 3 β€” Serialize to XML

Once the manifest is clean, serialize it. The sitemaps protocol requires an XML declaration, the urlset namespace, and well-formed <url> nodes.

// lib/buildSitemap.ts
export function buildSitemapXml(routes: Array<{ url: string; lastmod: string; priority: number }>) {
  const nodes = routes
    .map(
      (r) =>
        `  <url>\n    <loc>${r.url}</loc>\n    <lastmod>${r.lastmod}</lastmod>\n    <priority>${r.priority}</priority>\n  </url>`
    )
    .join('\n');
  return [
    '<?xml version="1.0" encoding="UTF-8"?>',
    '<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">',
    nodes,
    '</urlset>',
  ].join('\n');
}

Framework-Specific Builders

Next.js App Router β€” Metadata API

Next.js 13+ exposes a sitemap.ts export that the runtime calls on each request (or on a revalidation interval). This avoids running a full rebuild when new content is published.

// app/sitemap.ts
import type { MetadataRoute } from 'next';
import { fetchRoutes } from '@/lib/fetchRoutes';
import { filterRoutes } from '@/lib/filterRoutes';

export const revalidate = 3600; // ISR: regenerate every hour

export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
  const raw = await fetchRoutes();
  return filterRoutes(raw).map((r) => ({
    url: r.url,
    lastModified: new Date(r.lastmod),
    priority: r.priority,
  }));
}

SEO impact: ISR keeps the sitemap within one revalidation window of publication state. Combined with crawl budget management for headless, this ensures Googlebot is never directed at stale or non-existent URLs.

Validation:

curl -H "Accept: application/xml" https://yourdomain.com/sitemap.xml | xmllint --format -

Nuxt 3 β€” Nitro Server Route

Nuxt’s Nitro runtime allows you to register a server route at /sitemap.xml without any additional plugin.

// server/routes/sitemap.xml.ts
import { defineEventHandler, setResponseHeader } from 'h3';
import { fetchRoutes } from '~/lib/fetchRoutes';
import { filterRoutes } from '~/lib/filterRoutes';
import { buildSitemapXml } from '~/lib/buildSitemap';

export default defineEventHandler(async (event) => {
  setResponseHeader(event, 'Content-Type', 'application/xml; charset=utf-8');
  setResponseHeader(event, 'Cache-Control', 's-maxage=3600, stale-while-revalidate=86400');
  const raw = await fetchRoutes();
  return buildSitemapXml(filterRoutes(raw));
});

SEO impact: Nitro serves sitemaps with zero client-side overhead, making the endpoint robust during bot traffic spikes without incurring JavaScript execution cost.

Validation:

curl -I https://yourdomain.com/sitemap.xml
# Expect: HTTP/2 200 Β· Content-Type: application/xml; charset=utf-8

Astro β€” Sitemap Integration with Content Collections

For Astro sites using Content Collections, the official @astrojs/sitemap integration generates the sitemap at build time from your collection entries. A filter callback excludes draft and staging routes.

// astro.config.mjs
import { defineConfig } from 'astro/config';
import sitemap from '@astrojs/sitemap';

export default defineConfig({
  site: 'https://yoursite.com',
  integrations: [
    sitemap({
      filter: (page) => !page.includes('/draft/') && !page.includes('/preview/'),
      customPages: [], // add any server-rendered paths here
    }),
  ],
});

SEO impact: Automatically excludes non-indexable routes during build. Prevents index bloat from draft, preview, or staging URLs β€” a common cause of indexation limits in decoupled sites.

Validation:

npm run build
ls dist/sitemap*.xml      # expect sitemap-index.xml + sitemap-0.xml
xmllint --noout dist/sitemap-0.xml && echo "valid XML"

SvelteKit β€” Server Endpoint

SvelteKit uses file-based routing for server endpoints. A GET handler at src/routes/sitemap.xml/+server.ts serves the sitemap with full control over caching headers.

// src/routes/sitemap.xml/+server.ts
import type { RequestHandler } from './$types';
import { fetchRoutes } from '$lib/fetchRoutes';
import { filterRoutes } from '$lib/filterRoutes';
import { buildSitemapXml } from '$lib/buildSitemap';

export const GET: RequestHandler = async () => {
  const raw = await fetchRoutes();
  const xml = buildSitemapXml(filterRoutes(raw));
  return new Response(xml, {
    headers: {
      'Content-Type': 'application/xml; charset=utf-8',
      'Cache-Control': 's-maxage=3600, stale-while-revalidate=86400',
    },
  });
};

SEO impact: Uses the Fetch API Response directly, which adapts to SvelteKit’s adapter-node or adapter-cloudflare deployment targets without configuration changes.


URL Canonicalization Inside the Sitemap

Every <loc> value must be the exact canonical URL for that page. Before serialization, run the same normalization that canonical URL enforcement applies at the edge:

  1. Strip query strings and UTM parameters
  2. Enforce a trailing slash (or enforce no trailing slash β€” pick one and be consistent)
  3. Lowercase the path component
  4. Resolve relative URLs to absolute with SITE_URL

Mismatches between <loc> values and the rel="canonical" tag in the page <head> generate conflicting signals. Search engines must agree on which URL to index; your sitemap and your canonical tags must point to the same string.


HTTP Headers & CDN Directives Reference

Header Required Value Rationale
Content-Type application/xml; charset=utf-8 Required for valid XML MIME type recognition
Cache-Control s-maxage=3600, stale-while-revalidate=86400 Allows CDN to cache for 1 hour, serve stale for 24 h during revalidation
X-Content-Type-Options nosniff Prevents MIME sniffing on the XML response
Vary Accept-Encoding Enables Gzip/Brotli negotiation at the CDN layer
ETag Generated hash of sitemap content Enables conditional GET so crawlers skip unchanged sitemaps

For Cloudflare Pages or Vercel, inject these in your deployment headers config:

{
  "headers": [
    {
      "source": "/sitemap(.*)\\.xml",
      "headers": [
        { "key": "Cache-Control", "value": "s-maxage=3600, stale-while-revalidate=86400" },
        { "key": "X-Content-Type-Options", "value": "nosniff" },
        { "key": "Content-Type", "value": "application/xml; charset=utf-8" }
      ]
    }
  ]
}

Understanding the full picture of edge caching behavior for SEO helps you set revalidation windows that match your content publication frequency.


Sitemap Index Splitting for Large Sites

When a single site exceeds 50,000 URLs or 50 MB uncompressed (the sitemaps.org protocol limits), split the output into a sitemap index:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://yoursite.com/sitemap-articles.xml</loc>
    <lastmod>2026-06-22T00:00:00Z</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://yoursite.com/sitemap-categories.xml</loc>
    <lastmod>2026-06-22T00:00:00Z</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://yoursite.com/sitemap-authors.xml</loc>
    <lastmod>2026-06-22T00:00:00Z</lastmod>
  </sitemap>
</sitemapindex>

Submit sitemap_index.xml (not the child sitemaps) to Google Search Console. The index file itself has no URL limit; each referenced child file is capped at 50,000 entries.


Validation Protocol

Run this sequence before every deployment and after any CMS schema change:

# 1. Confirm HTTP 200 and correct Content-Type
curl -sI https://yourdomain.com/sitemap.xml | grep -E 'HTTP|content-type'

# 2. Validate XML is well-formed (requires libxml2)
curl -s https://yourdomain.com/sitemap.xml -o /tmp/sitemap.xml
xmllint --noout /tmp/sitemap.xml && echo "XML is well-formed"

# 3. Count URLs and compare to CMS published count
grep -c '<loc>' /tmp/sitemap.xml

# 4. Check for stale or draft URLs leaking in
grep -E '/draft/|/preview/|\?' /tmp/sitemap.xml && echo "LEAK FOUND" || echo "Clean"

# 5. Verify robots.txt references the sitemap
curl -s https://yourdomain.com/robots.txt | grep Sitemap

# 6. Submit to Google Search Console via API
curl -X PUT \
  "https://www.googleapis.com/webmasters/v3/sites/https%3A%2F%2Fyourdomain.com%2F/sitemaps/https%3A%2F%2Fyourdomain.com%2Fsitemap.xml" \
  -H "Authorization: Bearer $(gcloud auth print-access-token)"

After submission, monitor the Sitemaps diagnostic panel in Search Console for URL count discrepancies and any XML parsing errors the crawler reports.


Troubleshooting

Symptom Root Cause Fix
Sitemap URL count lower than CMS count Pagination loop exits early Add hasMore logic with a limit sentinel; log total from API header
lastmod values are all identical CMS API returns build time, not content update time Use item.updatedAt from the API, not new Date() at serialization time
Draft URLs appear in sitemap Filter runs after serialization Move filterRoutes() call before buildSitemapXml() in the pipeline
Crawler reports XML parse error String interpolation introduced unescaped & in URLs Run URLs through encodeURI() before placing in <loc>; escape & as &amp;
CDN serves stale sitemap after CMS publish Cache not purged on webhook Add a webhook handler that calls the CDN purge API for /sitemap*.xml on every publish event
Content-Type: text/html on sitemap endpoint Framework default MIME type Explicitly set Content-Type: application/xml; charset=utf-8 in the route handler
Sitemap not found in robots.txt Robots generated separately from sitemap config Generate both from the same environment variable: Sitemap: ${SITE_URL}/sitemap.xml

Frequently Asked Questions

Should sitemaps be generated at build time or runtime in headless setups?

Build time suits static sites with infrequent content changes. Runtime generation via ISR or a serverless handler is required for high-velocity CMS environments where new content must be discoverable within minutes of publication, not hours. A mixed approach β€” ISR with a short revalidation window β€” works well for most content teams.

How do I handle sitemap index splitting for large headless sites?

Implement a sitemap index file (sitemap_index.xml) that references segmented child sitemaps by content type or section. Cap each child file at 50,000 URLs or 50 MB uncompressed to comply with the sitemaps.org protocol. Submit only the index URL to Search Console.

Does headless architecture require manual robots.txt updates for sitemaps?

No. Generate robots.txt dynamically using the same serverless route or framework handler as your sitemap. Inject the correct sitemap URL from an environment variable β€” Sitemap: ${SITE_URL}/sitemap.xml β€” so it updates automatically across staging and production without manual edits.


Part of: Dynamic Routing & Indexation Workflows

Related