Pagination SEO Best Practices for Headless APIs

Q: What baseline metrics indicate pagination indexation failure?

GSC shows more than 30% of category URLs indexed as duplicates. Log files reveal high crawl frequency on ?page= parameters. Organic traffic drops on deep category pages.

Q: How do I validate canonical consistency across paginated API responses?

Run curl -I against multiple page parameters. Verify the Link header matches the intended canonical. Cross-reference results with the GSC URL Inspection tool.

Q: Should deep pagination pages be noindex or excluded from the sitemap?

Both. Apply noindex, follow via X-Robots-Tag so crawlers pass link equity through without indexing the page, and omit pages 2+ from the primary sitemap to avoid crawl-budget waste on discovery.

How to prevent index fragmentation and crawl-budget waste when a headless API drives your paginated category or archive routes.

When to Use This Approach

Apply these patterns when any of the following conditions are present:

Your headless CMS returns paginated REST or GraphQL responses and the frontend renders those as discrete /page/{n}/ or ?page={n} URLs.
Crawl budget in headless deployments is already under pressure — GSC shows more than 30 % of category URLs indexed as duplicates, or server logs reveal high bot frequency on ?page= parameters.
Organic traffic is decaying on deep category tiers (page 3 and beyond) while shallow pages perform normally, which signals that link equity is fragmenting across uncontrolled pagination endpoints.

Implementation Steps

Step 1 — Baseline audit before touching routing logic

Quantify the damage before writing any configuration. Changing canonical or indexation directives without a baseline makes it impossible to measure improvement.

# Parse Nginx/Cloudflare access logs for bot hits on ?page= parameters
grep -iE "Googlebot|bingbot" /var/log/nginx/access.log \
  | grep -E "\?page=|/page/[0-9]" \
  | awk '{print $7}' \
  | sort | uniq -c | sort -rn | head -40

Validation command: Export GSC Coverage → “Duplicate without user-selected canonical” filtered to your category URL pattern. Record the URL count before making changes.

Step 2 — Enforce `noindex, follow` on deep pagination via gateway headers

Apply the directive at the server or SSR layer — not via client-side JavaScript — so Googlebot receives it on the first HTTP response. Canonical URL enforcement at the edge ensures the header is never overwritten by CDN pass-through rules.

Nginx — apply noindex to page 2 and beyond:

location ~ ^/category/ {
  # Match /page/2/ through /page/999/ but not /page/1/
  if ($uri ~ "/page/([2-9]|[0-9]{2,})/") {
    add_header X-Robots-Tag "noindex, follow" always;
  }
}

Next.js App Router middleware (edge runtime):

// middleware.ts
import { NextResponse } from 'next/server';
import type { NextRequest } from 'next/server';

export function middleware(request: NextRequest) {
  const url = request.nextUrl;
  const pageMatch = url.pathname.match(/\/page\/(\d+)/);

  if (pageMatch && parseInt(pageMatch[1], 10) > 1) {
    const response = NextResponse.next();
    response.headers.set('X-Robots-Tag', 'noindex, follow');
    return response;
  }
}

export const config = {
  matcher: ['/category/:path*/page/:page*'],
};

Validation command:

curl -I -s "https://example.com/category/news/page/3/" \
  | grep -i "x-robots-tag"
# Expected: x-robots-tag: noindex, follow

Step 3 — Inject self-referencing canonical tags via SSR

Each paginated page should declare itself as its own canonical — not point to page 1. Pointing all pages at the root misleads crawlers about which URL holds the intended slice of content. The goal is to consolidate crawl signals on page 1 via noindex + link equity passthrough (follow), not by misrepresenting the page’s identity.

Next.js 15+ App Router (async params):

// app/category/[slug]/page/[page]/page.tsx
import type { Metadata } from 'next';

export async function generateMetadata({
  params,
}: {
  params: Promise<{ slug: string; page: string }>;
}): Promise<Metadata> {
  const { slug, page: pageParam } = await params;
  const page = parseInt(pageParam, 10);
  const base = `https://example.com/category/${slug}`;
  // Self-referencing canonical for every page
  const canonical = page === 1 ? base : `${base}/page/${page}`;

  return {
    alternates: { canonical },
    robots: page > 1 ? { index: false, follow: true } : undefined,
  };
}

SvelteKit server load (with setHeaders):

// src/routes/category/[slug]/page/[page]/+page.server.ts
import type { PageServerLoad } from './$types';

export const load: PageServerLoad = async ({ params, setHeaders }) => {
  const page = parseInt(params.page, 10);

  if (page > 1) {
    setHeaders({ 'X-Robots-Tag': 'noindex, follow' });
  }

  const res = await fetch(`/api/articles?slug=${params.slug}&page=${page}`);
  const data = await res.json();

  // Return canonical href for <svelte:head> injection
  const canonical =
    page === 1
      ? `https://example.com/category/${params.slug}`
      : `https://example.com/category/${params.slug}/page/${page}`;

  return { data, page, canonical };
};

Validation command:

# Verify canonical in rendered HTML (not just view-source — hydration can override)
curl -s "https://example.com/category/news/page/2/" \
  | grep -i 'rel="canonical"'

Step 4 — Return hard 404 for out-of-range page numbers

When a request arrives for /page/500/ but totalPages is 12, a 200 OK with an empty results array creates soft-404s. These accumulate in GSC as “Crawled — currently not indexed” and consume crawl budget on worthless endpoints.

Next.js App Router — notFound() for out-of-range:

// app/category/[slug]/page/[page]/page.tsx
import { notFound } from 'next/navigation';

export default async function CategoryPage({
  params,
}: {
  params: Promise<{ slug: string; page: string }>;
}) {
  const { slug, page: pageParam } = await params;
  const page = parseInt(pageParam, 10);

  const res = await fetch(
    `${process.env.CMS_URL}/articles?slug=${slug}&page=${page}`
  );
  const { items, totalPages } = await res.json();

  if (page > totalPages || items.length === 0) {
    notFound(); // Renders Next.js 404 page — returns HTTP 404
  }

  return <ArticleList items={items} />;
}

Express / API gateway middleware (for headless API layer):

// routes/articles.js
router.get('/articles', async (req, res) => {
  const page = parseInt(req.query.page || '1', 10);
  const { items, totalPages } = await cms.getArticles({ page });

  if (page > totalPages) {
    return res.status(404).json({ error: 'Page out of range' });
  }

  res.json({ items, totalPages });
});

Validation command:

# Confirm 404 status for an impossible page number
curl -o /dev/null -s -w "%{http_code}" \
  "https://example.com/category/news/page/9999/"
# Expected: 404

Step 5 — Validate canonical consistency and run CI checks

Manual spot-checks miss regressions. Wire pagination-specific assertions into your CI pipeline so a bad merge cannot silently reintroduce soft-404s or missing headers. For broader sitemap correctness, see XML sitemap generation for headless.

Shell validation script — check headers on the first four pages:

#!/usr/bin/env bash
BASE="https://example.com/category/news"

for page in 1 2 3 4; do
  url="${BASE}/page/${page}/"
  echo "=== $url ==="
  curl -I -s "$url" \
    | grep -iE "(http/|x-robots-tag|link:|location:)"
  echo
done

Playwright assertion (add to CI spec):

// tests/pagination.spec.ts
import { test, expect } from '@playwright/test';

const BASE = 'https://example.com/category/news';

test('page 1 has no noindex directive', async ({ page }) => {
  await page.goto(`${BASE}/page/1/`);
  const robots = await page.$eval(
    'meta[name="robots"]',
    (el: HTMLMetaElement) => el.content
  ).catch(() => '');
  expect(robots).not.toContain('noindex');
});

test('page 3 has noindex, follow', async ({ page }) => {
  const response = await page.goto(`${BASE}/page/3/`);
  expect(response?.headers()['x-robots-tag']).toContain('noindex');
});

test('out-of-range page returns 404', async ({ page }) => {
  const response = await page.goto(`${BASE}/page/9999/`);
  expect(response?.status()).toBe(404);
});

Validation command:

npx playwright test tests/pagination.spec.ts --reporter=line

SEO Impact Summary

Configured correctly	Misconfigured
Crawl budget concentrates on page 1 and unique content pages	All paginated pages consume equal crawl budget, diluting coverage of new content
GSC Coverage shows low “Duplicate without user-selected canonical” count	Thousands of paginated URLs appear as duplicates or soft-404s in Coverage
Link equity from inbound links to category root consolidates on page 1	Equity scatters across `/page/2/`, `/page/3/`, returning no ranking benefit
Organic traffic stable or growing on category root	Traffic decay on category root as Googlebot’s preferred URL shifts to an arbitrary paginated slice

Measurable signals to watch (first 4 weeks after deployment):

GSC → Coverage → “Duplicate without user-selected canonical” — should decline within 2 crawl cycles.
Server log bot hits on ?page= or /page/[2-9] — should drop as Googlebot learns not to re-crawl noindex pages.
Category root rankings — should stabilise or improve as consolidated link equity takes effect.

Edge Cases and Gotchas

Preview environments inject wrong canonical domains. If NEXT_PUBLIC_SITE_URL is set to a branch-preview URL in staging, canonical tags will point to https://preview-xyz.vercel.app/... and leak into production if the env var is not overridden. Always derive the canonical base from a server-side environment variable set per deployment target, not a public browser variable.

Multi-locale pagination doubles URL surface area. A site with en, fr, and de locales at /fr/category/page/2/ multiplies paginated URLs by the locale count. Apply the same noindex, follow logic per locale and ensure hreflang tags on page 1 reference only the page-1 equivalents in each locale — not paginated alternates.

Incremental Static Regeneration (ISR) and stale noindex headers. If you use ISR and regenerate paginated pages on demand, a new page that was previously out-of-range may now be valid. The noindex directive should be determined dynamically at request time (via middleware), not baked into the static payload at build time.

Infinite scroll UX with no fallback routes. Infinite scroll that replaces discrete page URLs is invisible to crawlers. Always maintain crawlable /page/{n}/ fallback paths. Serve the paginated HTML route server-side for bot user agents and load the infinite scroll JS enhancement on top for human visitors.

Redirect chains from ?page= to /page/ during a URL normalisation migration must not create 302 temporary redirects. A 302 from ?page=2 to /page/2/ tells Googlebot the original parameterized URL is the canonical, which can delay slug normalization benefits. Use 301 redirects and verify them with curl -IL.

Frequently Asked Questions

Should I use rel="next" and rel="prev" for headless pagination? Google deprecated them in 2019 as a ranking signal, but Bing and Yandex still parse them for sequence mapping. Implement them alongside self-referencing canonicals for cross-engine compatibility. Inject them via <link> elements in <head> during SSR — not client-side useEffect.

What baseline metrics indicate pagination indexation failure? GSC shows more than 30 % of category URLs indexed as duplicates. Server logs reveal high crawl frequency on ?page= parameters. Organic traffic declines on deep category pages while comparable non-paginated pages hold steady.

How do I validate canonical consistency across paginated API responses? Run curl -I against multiple page parameters and verify the Link header matches the intended canonical. Cross-reference results with the GSC URL Inspection tool for Googlebot’s rendered view.

Should deep pagination pages be noindex or excluded from the sitemap? Both. Apply noindex, follow via X-Robots-Tag so crawlers pass link equity through without indexing the page, and omit pages 2+ from the primary XML sitemap to avoid crawl-budget waste on discovery.

Part of: Pagination Handling in Headless

Related:

Pagination Handling in Headless — framework implementations, API contract mapping, and URL standardisation for paginated routes
Canonical URL Enforcement — edge-layer and SSR patterns for keeping canonical tags consistent across environments
Crawl Budget Impact in Headless — how crawler allocation is affected by paginated URLs and how to reclaim it
XML Sitemap Generation for Headless — sitemap chunking patterns that complement pagination noindex directives
Slug Normalization Strategies — 301 redirect patterns for migrating from ?page= to clean path-based pagination