Resolving Duplicate Content via Slug Standardization

Fix duplicate-URL proliferation in headless CMS deployments by enforcing a deterministic slug normalization pipeline from content creation through edge delivery.

When to apply this fix

Apply this workflow when you observe any of the following conditions:

  • Google Search Console reports “Duplicate without user-selected canonical” for URLs that differ only by case, separators, or encoding (e.g. /Blog/Post-Title, /blog/post_title, /blog/post%20title all serving identical content).
  • Server logs show crawl budget being split across two or more paths that resolve to the same HTML response.
  • Redirect chain management audits reveal chains that originate from slug case variants rather than intentional page moves.

Step 1 — Capture a crawl baseline

Establish pre-intervention metrics before touching any routing logic. Without a baseline you cannot prove the fix worked.

Crawler settings:

  • Disable JavaScript rendering (you need raw HTTP responses, not rendered state)
  • Set crawl depth to 3
  • Enable extraction for rel="canonical" values, hreflang attributes, and HTTP status codes
  • Run a log file parser in parallel to map Googlebot traffic share per URL variant

Metrics to record:

Metric Collection method Target after fix
HTTP 200/301/404 split Access log grep 301 count drops to zero for slug variants
Duplicate URLs sharing one canonical target Crawler export 0
Average redirect chain depth Curl batch script ≤ 1 hop
Crawl budget share on non-canonical paths Log parser < 2%

Validation command — count 200 responses for variant paths:

grep -E 'GET /(Blog|blog)/[A-Z]' /var/log/nginx/access.log | wc -l

Step 2 — Define deterministic normalization rules

A slug normalization strategy only eliminates duplicates if it is deterministic: the same input always produces the same output, with no runtime exceptions. Implement this transformation order:

  1. Unicode NFD decomposition
  2. Strip combining diacritics (̀–ͯ)
  3. Lowercase the result
  4. Replace runs of whitespace and underscores with a single hyphen
  5. Strip any character that is not a–z, 0–9, /, or -
  6. Collapse consecutive hyphens to one
  7. Remove leading/trailing hyphens from each path segment
function normalizeSlug(raw) {
  return raw
    .normalize('NFD')
    .replace(/[̀-ͯ]/g, '')
    .toLowerCase()
    .replace(/[\s_]+/g, '-')
    .replace(/[^a-z0-9\/-]/g, '')
    .replace(/-{2,}/g, '-')
    .replace(/(^-|-$)/gm, '');
}

Validation command — spot-check the function output:

node -e "
function normalizeSlug(raw){return raw.normalize('NFD').replace(/[̀-ͯ]/g,'').toLowerCase().replace(/[\s_]+/g,'-').replace(/[^a-z0-9\/-]/g,'').replace(/-{2,}/g,'-').replace(/(^-|-$)/gm,'');}
console.log(normalizeSlug('Héllo  World_Post'));
// expected: hello-world-post
"

Step 3 — Add a CMS pre-publish validation hook

Prevent future duplicates at the source by blocking malformed slugs before they reach the database. This is the single highest-leverage control point: every slug that escapes the CMS without normalization becomes a redirect rule or a canonical problem downstream.

// Contentful / custom CMS webhook handler
const SLUG_PATTERN = /^[a-z0-9]+(?:-[a-z0-9]+)*$/;

function validateAndNormalizeSlug(raw) {
  const normalized = normalizeSlug(raw.trim());

  if (!SLUG_PATTERN.test(normalized)) {
    throw new Error(
      `Slug "${raw}" cannot be normalized to a valid format. ` +
      `Got "${normalized}" — ensure the title contains only letters, numbers, or spaces.`
    );
  }

  return normalized;
}

Validation command — confirm the hook rejects a known bad input:

node -e "
const {validateAndNormalizeSlug} = require('./slug-validator');
try { validateAndNormalizeSlug('!!!'); } catch(e) { console.log('REJECTED:', e.message); }
"

Step 4 — Deploy edge 301 redirects for legacy variants

For any slug variant that already has external links or crawl history, a 301 redirect transfers link equity cleanly. Deploy these at the edge so Googlebot never reaches the origin with a non-canonical path — this also eliminates crawl budget drain from repeated fetches of redirected paths.

Next.js middleware — normalize and redirect at the edge:

// middleware.js
import { NextResponse } from 'next/server';

function normalizeSlug(pathname) {
  return pathname
    .normalize('NFD')
    .replace(/[̀-ͯ]/g, '')
    .toLowerCase()
    .replace(/[\s_]+/g, '-')
    .replace(/[^a-z0-9\/-]/g, '')
    .replace(/-{2,}/g, '-');
}

export function middleware(req) {
  const { pathname } = req.nextUrl;
  const normalized = normalizeSlug(pathname);

  if (normalized !== pathname) {
    const url = req.nextUrl.clone();
    url.pathname = normalized;
    return NextResponse.redirect(url, { status: 301 });
  }

  return NextResponse.next();
}

export const config = {
  matcher: ['/((?!_next|api|favicon).*)'],
};

SEO impact: Googlebot encounters the redirect on the first crawl of a variant and immediately follows it to the canonical path. The variant is removed from the crawl queue on the next fetch cycle.

Netlify static redirect rules (netlify.toml):

# Redirect known case-variant legacy URLs
[[redirects]]
  from = "/blog/Post-Title"
  to   = "/blog/post-title"
  status = 301
  force  = true

# Catch-all trailing-slash normalization
[[redirects]]
  from = "/blog/*/"
  to   = "/blog/:splat"
  status = 301
  force  = true

Validation command — confirm single-hop 301 resolution:

curl -I -s -L -o /dev/null \
  -w 'Status: %{http_code}  Final URL: %{url_effective}\n' \
  "https://yoursite.com/Blog/Old-Post-Title"
# expected: Status: 200  Final URL: https://yoursite.com/blog/old-post-title

Emitting a Link: <url>; rel="canonical" HTTP response header reinforces the canonical signal before the HTML <head> is parsed. This matters for crawlers that read headers before body content, and for CDN cached responses where the HTML canonical tag may reflect an outdated path.

This complements the canonical URL enforcement rules applied at the document level.

// middleware.js — add to the middleware from Step 4
export function middleware(req) {
  const { pathname } = req.nextUrl;
  const normalized = normalizeSlug(pathname);

  if (normalized !== pathname) {
    const url = req.nextUrl.clone();
    url.pathname = normalized;
    return NextResponse.redirect(url, { status: 301 });
  }

  const res = NextResponse.next();
  const canonical = `https://${req.headers.get('host')}${normalized}`;
  res.headers.set('Link', `<${canonical}>; rel="canonical"`);
  return res;
}

Validation command — confirm the header is present on a live response:

curl -sI "https://yoursite.com/blog/post-title" | grep -i 'link:'
# expected: link: <https://yoursite.com/blog/post-title>; rel="canonical"

Step 6 — Validate and resubmit the sitemap

After deploying edge redirects and canonical headers, regenerate the XML sitemap from your canonical route manifest. Any legacy variant URL still present in sitemap.xml tells Googlebot those paths are canonical — contradicting the headers you just deployed.

Sitemap generation — build from the canonical manifest only:

// scripts/generate-sitemap.js
import { writeFileSync } from 'fs';
import { canonicalRoutes } from './canonical-routes.js'; // your normalized route list

const sitemap = `<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
${canonicalRoutes.map(url => `  <url><loc>${url}</loc></url>`).join('\n')}
</urlset>`;

writeFileSync('public/sitemap.xml', sitemap);
console.log(`Wrote ${canonicalRoutes.length} canonical URLs to sitemap.xml`);

Validation checklist:

# 1. Validate sitemap XML structure
xmllint --noout public/sitemap.xml && echo "XML valid"

# 2. Confirm zero legacy variant URLs remain in the sitemap
grep -i '/Blog/' public/sitemap.xml | wc -l
# expected: 0

# 3. Diff old vs new sitemap to review changes before submitting
diff old-sitemap.xml public/sitemap.xml | head -40

After confirming the sitemap is clean, submit it via Google Search Console → Sitemaps, or call the GSC URL Inspection API to request a reindex of the canonical paths.


SVG: slug normalization decision flow

Slug normalization decision flow Flowchart showing a raw CMS slug entering the normalization pipeline, a CMS hook blocking invalid slugs at publish, an edge redirect converting variant URLs to canonical form, and a canonical Link header emitted on all responses. Raw CMS slug e.g. "Blog/Post Title" normalizeSlug() NFD → lower → hyphens CMS pre-publish hook: reject or save Edge middleware 301 + canonical header Canonical URL /blog/post-title Sitemap & GSC resubmit ❌ blocked at save Slug Normalization: End-to-End Decision Flow

SEO impact summary

Signal What improves What breaks if misconfigured
Crawl budget All budget goes to canonical paths; variant fetches stop Overlapping redirect rules create loops, multiplying redirect hops
Link equity Inbound links to variants are consolidated via 301 A missing 301 leaves equity on a non-canonical path permanently
Index coverage GSC ‘Duplicate without user-selected canonical’ count drops to 0 Canonical header pointing to a different URL than the <link rel=canonical> tag confuses crawlers
Click-through rate Consistent URL format in SERPs improves brand trust Case-inconsistent URLs in rich results erode trust signals

Measurable signals to watch (14-day window post-deploy):

  • GSC: “Duplicate without user-selected canonical” count
  • GSC: Crawl stats → Total crawl requests (should drop as variants are resolved)
  • Server logs: 301 response count for slug variant paths (should trend toward zero as Googlebot stops re-fetching them)
  • Sitemap: Confirm only canonical form URLs appear

Edge cases and gotchas

Preview environments share the same CMS slug namespace. If your preview URL pattern (/preview/Blog/Post-Title) bypasses the middleware matcher, Googlebot can crawl preview URLs if a deploy leaks them. Exclude preview paths explicitly:

export const config = {
  matcher: ['/((?!_next|api|preview|favicon).*)'],
};

Multi-locale deployments can produce cross-locale collisions. /en/cafe and /fr/café normalize to the same slug under a naive strip-diacritics rule. Scope normalization to the locale segment: strip diacritics only for ASCII-target locales; preserve characters for locales that use them as meaningful distinctions.

CMS auto-slug regeneration on title edits overwrites manual corrections. Disable the auto-slug behavior in your CMS schema after initial slug creation. In Contentful, this means removing the “slug” field’s entry pattern dependency on the title field after first publish. In Sanity, set readOnly: (ctx) => !ctx.newDocument on the slug field.

Incremental static builds may serve stale pre-normalized HTML until the ISR revalidation window expires. If your ISR TTL is 24 hours, a just-deployed normalization fix may not reach all cached pages immediately. Force revalidation of affected paths via the on-demand revalidation API:

# Next.js on-demand ISR revalidation
curl -X POST "https://yoursite.com/api/revalidate?path=/blog/old-path&secret=TOKEN"

Redirect loops from overlapping wildcard rules. A catch-all [[redirects]] rule in netlify.toml that matches the target URL of another rule creates an infinite loop. Confirm single-hop resolution with:

curl -L -s -o /dev/null -w "%{num_redirects} redirect(s) to %{url_effective}\n" \
  "https://yoursite.com/Blog/Old-Post-Title"
# expected: 1 redirect(s) to https://yoursite.com/blog/old-post-title

Frequently Asked Questions

How do I measure duplicate content reduction after deploying slug standardization?

Compare pre- and post-deployment crawl exports for unique indexed URLs. Monitor the GSC “Duplicate without user-selected canonical” report under Coverage → Excluded. Expect a measurable drop within 14 days as Googlebot recrawls normalized paths and stops queuing variants. A faster signal is the access log: 301 responses for case-variant paths should trend to zero within two to three Googlebot crawl cycles.

Should I use 301 redirects or canonical tags to consolidate slug variants?

Use 301 redirects for any legacy variant you can enumerate in a redirect table — they pass link equity cleanly and stop crawl budget waste immediately. Deploy canonical tags as a fallback for dynamically generated variants (e.g. filtered listing URLs, session-parameterized paths) that cannot safely return a 301 without breaking user-facing functionality.

How do I handle localized slug duplicates in a headless architecture?

Scope normalization rules per locale prefix and assign each locale its own slug namespace (/fr/, /de/). Enforce hreflang annotations so Googlebot maps language variants correctly rather than treating them as duplicates. For locales that use diacritics as meaningful characters (e.g. /de/strasse vs /de/straße), do not strip those characters — apply diacritic stripping only to ASCII-target locales.


Part of: Slug Normalization Strategies

Related: