Setting Up Dynamic Sitemaps for Composable CMS

Q: How do I validate a dynamic sitemap without triggering a full site rebuild?

Use xmllint for schema validation and curl -I to verify HTTP 200 plus correct Content-Type: application/xml headers at the edge. Run these checks in a pre-deploy CI step against your staging URL.

Q: What is the maximum URL count per sitemap file for optimal crawling?

Limit each file to 50,000 URLs or 50 MB uncompressed. Split larger datasets into indexed sitemaps using a master sitemap-index.xml to preserve crawl efficiency.

Q: How do I handle draft or scheduled content in a composable CMS sitemap?

Filter by status=published and publishDate <= now in your API query. Exclude draft and scheduled records server-side during serialization to prevent premature indexing.

Q: What rollback strategy works best if the dynamic sitemap endpoint fails?

Deploy a CI-generated static sitemap-fallback.xml to the CDN root. Route /sitemap.xml to it via health-check middleware that monitors origin HTTP response codes and activates automatically on 5xx or timeouts.

A dynamic XML sitemap in a composable CMS must do three things simultaneously: reflect the live published state of the content API, stay visible at the edge without hammering the origin, and remain valid XML that Googlebot can parse without error.

When to use this approach

Apply a fully dynamic, API-driven sitemap when:

Content editors publish or unpublish several times a day and a build-time static sitemap would go stale within hours.
Your CMS serves multiple content types (articles, product pages, taxonomy pages) whose URL shapes differ and cannot be safely hardcoded.
You are already using incremental static regeneration and need the sitemap to follow the same revalidation cadence as the pages it references.

Step 1 — Map CMS content types to canonical URLs

Before writing any sitemap code, audit which content types the CMS exposes and what their published URL shapes look like.

# List all published slugs for a Contentful space via the Delivery API
curl -s "https://cdn.contentful.com/spaces/$SPACE_ID/entries?content_type=article&select=fields.slug,sys.updatedAt&limit=1000&fields.publishedAt[lte]=$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  -H "Authorization: Bearer $CONTENTFUL_DELIVERY_TOKEN" \
  | jq '.items[].fields.slug'

Validation: Compare the count returned by the API to the number of published entries in the CMS dashboard. Any mismatch signals a pagination gap—handle it with cursor-based iteration before you proceed.

Mismatched locale prefixes (e.g., /blog/post vs /en/blog/post) produce duplicate canonical signals; enforce a single prefix at this mapping stage and carry it through every downstream step. The rules for consistent slug shapes are covered in slug normalization strategies.

Step 2 — Build the dynamic sitemap endpoint with ISR

The sitemap handler should run server-side and revalidate on a schedule, not at every request. Here is a Next.js App Router implementation that revalidates hourly:

// app/sitemap.ts — Next.js 14+ App Router
import type { MetadataRoute } from 'next';

// Revalidate this route every hour; webhooks can trigger on-demand revalidation
export const revalidate = 3600;

async function fetchAllSlugs(contentType: string): Promise<{ slug: string; updatedAt: string }[]> {
  const pages: { slug: string; updatedAt: string }[] = [];
  let skip = 0;
  const limit = 1000;

  while (true) {
    const res = await fetch(
      `https://cdn.contentful.com/spaces/${process.env.CONTENTFUL_SPACE_ID}/entries` +
        `?content_type=${contentType}&select=fields.slug,sys.updatedAt` +
        `&limit=${limit}&skip=${skip}`,
      { headers: { Authorization: `Bearer ${process.env.CONTENTFUL_DELIVERY_TOKEN}` } }
    );
    const data = await res.json();
    pages.push(
      ...data.items.map((item: any) => ({
        slug: item.fields.slug as string,
        updatedAt: item.sys.updatedAt as string,
      }))
    );
    if (skip + limit >= data.total) break;
    skip += limit;
  }
  return pages;
}

export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
  const [articles, products] = await Promise.all([
    fetchAllSlugs('article'),
    fetchAllSlugs('product'),
  ]);

  const base = 'https://seo-architecture.com';

  return [
    ...articles.map((a) => ({
      url: `${base}/blog/${a.slug}`,
      lastModified: new Date(a.updatedAt),
      priority: 0.8,
    })),
    ...products.map((p) => ({
      url: `${base}/products/${p.slug}`,
      lastModified: new Date(p.updatedAt),
      priority: 0.6,
    })),
  ];
}

Validation: After deploying, request /sitemap.xml and count the <url> entries. The count must equal the total returned by fetchAllSlugs across all content types.

The paginated while loop above is the critical difference from a naive fetch. Without it, the Contentful Delivery API default limit of 100 entries silently truncates the output and feeds Googlebot an incomplete map—one of the common indexation limits for decoupled sites.

Step 3 — Configure CDN cache-control headers

The sitemap endpoint must be cached at the edge; otherwise a crawl spike can exhaust your origin’s rate limit against the CMS API. Set s-maxage to match your revalidation window and add stale-while-revalidate so the CDN continues serving during background refresh:

{
  "headers": [
    {
      "source": "/sitemap.xml",
      "headers": [
        {
          "key": "Cache-Control",
          "value": "public, s-maxage=3600, stale-while-revalidate=86400"
        }
      ]
    }
  ]
}

Register a cache-tag purge webhook in your CMS so that when an editor publishes or unpublishes a page, the CDN immediately invalidates the cached sitemap. Without this, the stale-while-revalidate window can keep a deleted URL in the sitemap for up to 24 hours. The full model for edge caching behaviour in headless SEO explains cache-tag patterns in more detail.

Target metrics:

Signal	Target
Edge cache hit ratio for `/sitemap.xml`	> 90 %
Origin requests during a Googlebot crawl spike	< 5 req/min
Cache invalidation latency after a CMS publish event	< 2 s

Step 4 — Validate XML structure and HTTP response in CI

Run these three checks in your CI pipeline as a pre-deploy gate. A broken sitemap that slips to production costs days of recrawl recovery.

SITE=https://yoursite.com

# 1. Confirm the endpoint returns 200 OK
curl -sI "$SITE/sitemap.xml" | grep -E '^HTTP'

# 2. Validate XML is well-formed (requires libxml2)
curl -s "$SITE/sitemap.xml" -o /tmp/sitemap.xml && \
  xmllint --noout /tmp/sitemap.xml && echo "XML is well-formed"

# 3. Confirm robots.txt references the sitemap
curl -s "$SITE/robots.txt" | grep -i sitemap

Validation: All three commands must exit 0. If xmllint fails, the most common cause is an unescaped & in a URL query string—encode it as & during serialization.

To control crawl budget in headless deployments, confirm that robots.txt blocks non-indexable paths (preview URLs, API routes, internal search results) before submitting the sitemap. Googlebot allocates crawl quota to every URL in the sitemap; phantom routes drain that budget.

Step 5 — Submit to Google Search Console and deploy a static fallback

Submit the sitemap via the Search Console API immediately after each deployment so Google picks up structural changes without waiting for the next organic crawl:

# Requires gcloud CLI authenticated to a service account with Search Console permissions
gcloud auth print-access-token | xargs -I TOKEN curl -X PUT \
  "https://www.googleapis.com/webmasters/v3/sites/https%3A%2F%2Fyoursite.com%2F/sitemaps/https%3A%2F%2Fyoursite.com%2Fsitemap.xml" \
  -H "Authorization: Bearer TOKEN"

Then deploy a static fallback so that if the dynamic origin goes down, Googlebot still gets a valid sitemap:

# Build and upload the fallback during CI, triggered on every successful main-branch build
curl -s https://yoursite.com/sitemap.xml -o sitemap-fallback.xml
xmllint --noout sitemap-fallback.xml && echo "Fallback is valid"
# Upload sitemap-fallback.xml to your CDN/object storage here

Wire middleware to route /sitemap.xml to sitemap-fallback.xml whenever the dynamic origin returns 5xx or times out. Target a fallback activation time under 2 seconds.

Here is the complete pipeline in one diagram:

SEO impact summary

What improves:

Googlebot discovers new and updated URLs within the ISR revalidation window (≤ 1 hour by default) rather than waiting for the next full crawl cycle.
Edge caching eliminates origin pressure during bot traffic spikes, keeping Time to First Byte under 100 ms for the sitemap response.
Filtering drafts and noindex routes before serialization prevents wasted crawl budget on non-indexable pages.

What breaks if misconfigured:

Missing s-maxage causes every Googlebot request to hit the CMS API, risking rate-limit errors that return partial or empty XML.
Omitting the pagination loop against the CMS API silently truncates URL lists, causing pages to fall out of the index without 404s to diagnose.
No webhook-triggered cache purge means editors cannot urgently remove a mis-published URL from the sitemap for up to 24 hours.

Measurable signals to watch: sitemap URL count in Search Console (should match CMS published count ± 1 %), index coverage for sitemap-submitted URLs (aim for > 95 % indexed within 2 weeks of publish), and edge cache hit ratio for /sitemap.xml (target > 90 %).

Edge cases and gotchas

Preview and draft environments. If your framework runs a preview mode (e.g., Next.js Draft Mode), ensure the sitemap handler explicitly ignores the preview cookie/header and only queries the status=published filter. Preview builds frequently expose draft slugs that would otherwise leak into the serialized output.

Multi-locale sites. When your CMS stores locale variants, generate separate <url> entries with <xhtml:link rel="alternate"> hreflang annotations, or maintain per-locale sitemap shards referenced from a sitemap-index.xml. Do not mix locale paths in a single flat list—this confuses canonical resolution and can produce the same duplicate content signals addressed in slug standardization.

Incremental builds and partial rebuilds. Some self-hosted ISR setups only regenerate pages whose cache entries have expired, not the sitemap itself. Add an explicit revalidation tag to the sitemap route so on-demand revalidation via revalidatePath('/sitemap.xml') fires whenever content changes, independent of individual page revalidations.

Sitemap index splits. At 50,000 URLs the sitemap spec requires an index file. Structure your generator to detect the threshold and automatically produce /sitemap-index.xml referencing /sitemap-articles.xml, /sitemap-products.xml, and so on. Verify the index file is also listed in robots.txt, not just the leaf sitemaps.

robots.txt must reference the canonical path. A mismatch between the Sitemap: directive in robots.txt and the actual sitemap URL (e.g., http vs https, or /sitemap.xml vs /sitemap) prevents Search Console from picking up the submission. The XML sitemap generation for headless reference covers robots.txt integration and multi-sitemap index patterns.

Frequently Asked Questions

How do I validate a dynamic sitemap without triggering a full site rebuild? Use xmllint for schema validation and curl -I to verify HTTP 200 plus correct Content-Type: application/xml headers at the edge. Run these checks in a pre-deploy CI step against the staging URL—they hit the live endpoint without touching the build process.

What is the maximum URL count per sitemap file for optimal crawling? Limit each file to 50,000 URLs or 50 MB uncompressed. Split larger datasets into a sitemap-index.xml that references individual shard files; Googlebot processes the index and queues the shards independently.

How do I handle draft or scheduled content in a composable CMS sitemap? Filter by status=published and publishDate <= now() in your API query. Exclude draft and future-scheduled records server-side before serialization. Never rely on the CMS UI to prevent leakage—always enforce the filter in code.

What rollback strategy works best if the dynamic sitemap endpoint fails? Deploy a CI-generated sitemap-fallback.xml to your CDN root on every successful build. Route /sitemap.xml to it via health-check middleware that activates automatically when the origin returns 5xx or times out. Validate the fallback with xmllint as part of the CI step that uploads it.

Part of: Indexation Limits for Decoupled Sites

Related