XML Sitemap Generation for Headless
In a headless deployment, your CMS and your rendering layer have no shared filesystem β which means the crawler discovery contract that a monolithic CMS manages automatically becomes your engineering responsibility. This page walks through the full pipeline: fetching routes from a content API, serializing them to a valid XML sitemap, applying CDN caching rules, and validating the result before submission to Search Console.
Prerequisites
Before implementing, confirm the following are in place:
- Framework version: Next.js 13+ (App Router), Nuxt 3.x, or Astro 3+
- Environment variables:
SITE_URL(absolute origin, no trailing slash),CMS_API_URL, and any bearer token for private APIs - Tooling:
curl,xmllint(fromlibxml2), and access to Google Search Console for your property - Slug normalization: all CMS slugs must already be lowercased, hyphenated, and free of trailing slashes before route extraction begins
- Dynamic route generation: parameterized routes must be resolved to concrete paths before sitemap serialization
How the Pipeline Fits Together
The diagram below shows how a CMS publish event propagates through the sitemap pipeline to a crawler discovery event.
Step-by-Step Implementation Workflow
Step 1 β Build a Route Manifest from the CMS API
Fetch all published entries in a single paginated loop. Write a flat array of objects before any XML serialization:
// lib/fetchRoutes.ts
interface RouteEntry {
url: string;
lastmod: string;
priority: number;
}
export async function fetchRoutes(): Promise<RouteEntry[]> {
const base = process.env.CMS_API_URL!;
const site = process.env.SITE_URL!;
const entries: RouteEntry[] = [];
let page = 1;
let hasMore = true;
while (hasMore) {
const res = await fetch(`${base}/entries?status=published&page=${page}&limit=200`);
const data: { items: Array<{ slug: string; updatedAt: string; type: string }> } = await res.json();
for (const item of data.items) {
entries.push({
url: `${site}/${item.slug}/`,
lastmod: new Date(item.updatedAt).toISOString(),
priority: item.type === 'article' ? 0.8 : 0.5,
});
}
hasMore = data.items.length === 200;
page++;
}
return entries;
}
Validation: compare entries.length to the CMS published count via the API dashboard before proceeding.
Step 2 β Filter Non-Indexable Routes
Apply the filter layer shown in the diagram. Strip draft slugs, parameterized paths, and pagination variants before serialization:
// lib/filterRoutes.ts
export function filterRoutes(routes: Array<{ url: string; lastmod: string; priority: number }>) {
const EXCLUDE = [/\/draft\//, /[?&]/, /\/page\/\d+\/?$/, /\/preview\//];
return routes.filter((r) => !EXCLUDE.some((re) => re.test(r.url)));
}
This step is required if pagination in headless uses numeric URL suffixes that are not meant to be indexed independently.
Step 3 β Serialize to XML
Once the manifest is clean, serialize it. The sitemaps protocol requires an XML declaration, the urlset namespace, and well-formed <url> nodes.
// lib/buildSitemap.ts
export function buildSitemapXml(routes: Array<{ url: string; lastmod: string; priority: number }>) {
const nodes = routes
.map(
(r) =>
` <url>\n <loc>${r.url}</loc>\n <lastmod>${r.lastmod}</lastmod>\n <priority>${r.priority}</priority>\n </url>`
)
.join('\n');
return [
'<?xml version="1.0" encoding="UTF-8"?>',
'<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">',
nodes,
'</urlset>',
].join('\n');
}
Framework-Specific Builders
Next.js App Router β Metadata API
Next.js 13+ exposes a sitemap.ts export that the runtime calls on each request (or on a revalidation interval). This avoids running a full rebuild when new content is published.
// app/sitemap.ts
import type { MetadataRoute } from 'next';
import { fetchRoutes } from '@/lib/fetchRoutes';
import { filterRoutes } from '@/lib/filterRoutes';
export const revalidate = 3600; // ISR: regenerate every hour
export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
const raw = await fetchRoutes();
return filterRoutes(raw).map((r) => ({
url: r.url,
lastModified: new Date(r.lastmod),
priority: r.priority,
}));
}
SEO impact: ISR keeps the sitemap within one revalidation window of publication state. Combined with crawl budget management for headless, this ensures Googlebot is never directed at stale or non-existent URLs.
Validation:
curl -H "Accept: application/xml" https://yourdomain.com/sitemap.xml | xmllint --format -
Nuxt 3 β Nitro Server Route
Nuxtβs Nitro runtime allows you to register a server route at /sitemap.xml without any additional plugin.
// server/routes/sitemap.xml.ts
import { defineEventHandler, setResponseHeader } from 'h3';
import { fetchRoutes } from '~/lib/fetchRoutes';
import { filterRoutes } from '~/lib/filterRoutes';
import { buildSitemapXml } from '~/lib/buildSitemap';
export default defineEventHandler(async (event) => {
setResponseHeader(event, 'Content-Type', 'application/xml; charset=utf-8');
setResponseHeader(event, 'Cache-Control', 's-maxage=3600, stale-while-revalidate=86400');
const raw = await fetchRoutes();
return buildSitemapXml(filterRoutes(raw));
});
SEO impact: Nitro serves sitemaps with zero client-side overhead, making the endpoint robust during bot traffic spikes without incurring JavaScript execution cost.
Validation:
curl -I https://yourdomain.com/sitemap.xml
# Expect: HTTP/2 200 Β· Content-Type: application/xml; charset=utf-8
Astro β Sitemap Integration with Content Collections
For Astro sites using Content Collections, the official @astrojs/sitemap integration generates the sitemap at build time from your collection entries. A filter callback excludes draft and staging routes.
// astro.config.mjs
import { defineConfig } from 'astro/config';
import sitemap from '@astrojs/sitemap';
export default defineConfig({
site: 'https://yoursite.com',
integrations: [
sitemap({
filter: (page) => !page.includes('/draft/') && !page.includes('/preview/'),
customPages: [], // add any server-rendered paths here
}),
],
});
SEO impact: Automatically excludes non-indexable routes during build. Prevents index bloat from draft, preview, or staging URLs β a common cause of indexation limits in decoupled sites.
Validation:
npm run build
ls dist/sitemap*.xml # expect sitemap-index.xml + sitemap-0.xml
xmllint --noout dist/sitemap-0.xml && echo "valid XML"
SvelteKit β Server Endpoint
SvelteKit uses file-based routing for server endpoints. A GET handler at src/routes/sitemap.xml/+server.ts serves the sitemap with full control over caching headers.
// src/routes/sitemap.xml/+server.ts
import type { RequestHandler } from './$types';
import { fetchRoutes } from '$lib/fetchRoutes';
import { filterRoutes } from '$lib/filterRoutes';
import { buildSitemapXml } from '$lib/buildSitemap';
export const GET: RequestHandler = async () => {
const raw = await fetchRoutes();
const xml = buildSitemapXml(filterRoutes(raw));
return new Response(xml, {
headers: {
'Content-Type': 'application/xml; charset=utf-8',
'Cache-Control': 's-maxage=3600, stale-while-revalidate=86400',
},
});
};
SEO impact: Uses the Fetch API Response directly, which adapts to SvelteKitβs adapter-node or adapter-cloudflare deployment targets without configuration changes.
URL Canonicalization Inside the Sitemap
Every <loc> value must be the exact canonical URL for that page. Before serialization, run the same normalization that canonical URL enforcement applies at the edge:
- Strip query strings and UTM parameters
- Enforce a trailing slash (or enforce no trailing slash β pick one and be consistent)
- Lowercase the path component
- Resolve relative URLs to absolute with
SITE_URL
Mismatches between <loc> values and the rel="canonical" tag in the page <head> generate conflicting signals. Search engines must agree on which URL to index; your sitemap and your canonical tags must point to the same string.
HTTP Headers & CDN Directives Reference
| Header | Required Value | Rationale |
|---|---|---|
Content-Type |
application/xml; charset=utf-8 |
Required for valid XML MIME type recognition |
Cache-Control |
s-maxage=3600, stale-while-revalidate=86400 |
Allows CDN to cache for 1 hour, serve stale for 24 h during revalidation |
X-Content-Type-Options |
nosniff |
Prevents MIME sniffing on the XML response |
Vary |
Accept-Encoding |
Enables Gzip/Brotli negotiation at the CDN layer |
ETag |
Generated hash of sitemap content | Enables conditional GET so crawlers skip unchanged sitemaps |
For Cloudflare Pages or Vercel, inject these in your deployment headers config:
{
"headers": [
{
"source": "/sitemap(.*)\\.xml",
"headers": [
{ "key": "Cache-Control", "value": "s-maxage=3600, stale-while-revalidate=86400" },
{ "key": "X-Content-Type-Options", "value": "nosniff" },
{ "key": "Content-Type", "value": "application/xml; charset=utf-8" }
]
}
]
}
Understanding the full picture of edge caching behavior for SEO helps you set revalidation windows that match your content publication frequency.
Sitemap Index Splitting for Large Sites
When a single site exceeds 50,000 URLs or 50 MB uncompressed (the sitemaps.org protocol limits), split the output into a sitemap index:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://yoursite.com/sitemap-articles.xml</loc>
<lastmod>2026-06-22T00:00:00Z</lastmod>
</sitemap>
<sitemap>
<loc>https://yoursite.com/sitemap-categories.xml</loc>
<lastmod>2026-06-22T00:00:00Z</lastmod>
</sitemap>
<sitemap>
<loc>https://yoursite.com/sitemap-authors.xml</loc>
<lastmod>2026-06-22T00:00:00Z</lastmod>
</sitemap>
</sitemapindex>
Submit sitemap_index.xml (not the child sitemaps) to Google Search Console. The index file itself has no URL limit; each referenced child file is capped at 50,000 entries.
Validation Protocol
Run this sequence before every deployment and after any CMS schema change:
# 1. Confirm HTTP 200 and correct Content-Type
curl -sI https://yourdomain.com/sitemap.xml | grep -E 'HTTP|content-type'
# 2. Validate XML is well-formed (requires libxml2)
curl -s https://yourdomain.com/sitemap.xml -o /tmp/sitemap.xml
xmllint --noout /tmp/sitemap.xml && echo "XML is well-formed"
# 3. Count URLs and compare to CMS published count
grep -c '<loc>' /tmp/sitemap.xml
# 4. Check for stale or draft URLs leaking in
grep -E '/draft/|/preview/|\?' /tmp/sitemap.xml && echo "LEAK FOUND" || echo "Clean"
# 5. Verify robots.txt references the sitemap
curl -s https://yourdomain.com/robots.txt | grep Sitemap
# 6. Submit to Google Search Console via API
curl -X PUT \
"https://www.googleapis.com/webmasters/v3/sites/https%3A%2F%2Fyourdomain.com%2F/sitemaps/https%3A%2F%2Fyourdomain.com%2Fsitemap.xml" \
-H "Authorization: Bearer $(gcloud auth print-access-token)"
After submission, monitor the Sitemaps diagnostic panel in Search Console for URL count discrepancies and any XML parsing errors the crawler reports.
Troubleshooting
| Symptom | Root Cause | Fix |
|---|---|---|
| Sitemap URL count lower than CMS count | Pagination loop exits early | Add hasMore logic with a limit sentinel; log total from API header |
lastmod values are all identical |
CMS API returns build time, not content update time | Use item.updatedAt from the API, not new Date() at serialization time |
| Draft URLs appear in sitemap | Filter runs after serialization | Move filterRoutes() call before buildSitemapXml() in the pipeline |
| Crawler reports XML parse error | String interpolation introduced unescaped & in URLs |
Run URLs through encodeURI() before placing in <loc>; escape & as & |
| CDN serves stale sitemap after CMS publish | Cache not purged on webhook | Add a webhook handler that calls the CDN purge API for /sitemap*.xml on every publish event |
Content-Type: text/html on sitemap endpoint |
Framework default MIME type | Explicitly set Content-Type: application/xml; charset=utf-8 in the route handler |
| Sitemap not found in robots.txt | Robots generated separately from sitemap config | Generate both from the same environment variable: Sitemap: ${SITE_URL}/sitemap.xml |
Frequently Asked Questions
Should sitemaps be generated at build time or runtime in headless setups?
Build time suits static sites with infrequent content changes. Runtime generation via ISR or a serverless handler is required for high-velocity CMS environments where new content must be discoverable within minutes of publication, not hours. A mixed approach β ISR with a short revalidation window β works well for most content teams.
How do I handle sitemap index splitting for large headless sites?
Implement a sitemap index file (sitemap_index.xml) that references segmented child sitemaps by content type or section. Cap each child file at 50,000 URLs or 50 MB uncompressed to comply with the sitemaps.org protocol. Submit only the index URL to Search Console.
Does headless architecture require manual robots.txt updates for sitemaps?
No. Generate robots.txt dynamically using the same serverless route or framework handler as your sitemap. Inject the correct sitemap URL from an environment variable β Sitemap: ${SITE_URL}/sitemap.xml β so it updates automatically across staging and production without manual edits.
Part of: Dynamic Routing & Indexation Workflows
Related
- Dynamic Route Generation β resolving parameterized slugs to concrete paths before manifest assembly
- Slug Normalization Strategies β enforcing consistent URL shapes that match sitemap
<loc>values - Canonical URL Enforcement β aligning edge-level canonical headers with sitemap declarations
- Pagination Handling in Headless β deciding which paginated URLs belong in the sitemap
- Indexation Limits for Decoupled Sites β understanding why sitemap hygiene directly affects crawl quota allocation