Soft 404 Detection: When HTTP 200 Actually Means Broken
A soft 404 is a page that returns HTTP 200 OK while displaying a “page not found” message to the user. The server says the resource is fine. The page itself says the resource is gone. Your link checker, looking only at the status code, marks the URL as healthy. Your client opens the report, clicks the link, and gets a 404 page anyway.
This is the single most common failure mode of homegrown URL checkers. The fix is straightforward in concept and surprisingly tricky in practice.
Why soft 404s exist in the first place
Soft 404s come from CMS defaults, careless framework templates, and well-meaning “helpful” redirects:
- CMS “catch-all” routes. A Shopify storefront with no product at
/products/discontinued-itemoften returns 200 plus a styled “Sorry, this product is no longer available” page. The status code matches the template, not the resource. - SPA fallbacks. Next.js, React Router, and friends serve
index.htmlwith a 200 for any path that does not match a known route. The JavaScript then renders “404” client-side. To a non-rendering checker, every URL looks healthy. - Redirect-to-home behavior. A site redirects 404s to
/with a 301, returning 200 at the final URL. Technically the link “works,” but from a user's perspective the article they clicked for is gone. - Servers that lie. Some legacy systems were configured to return 200 for everything because someone thought 404s “looked bad” in monitoring.
Google's own documentation explicitly warns about soft 404s for SEO reasons: the search engine wastes crawl budget on them and may de-prioritize the site. From a link-checker's perspective, soft 404s are worse than hard 404s because they hide.
The detection problem
The reliable signal that a 200 response is actually a 404 is some combination of:
- The
<title>contains a not-found phrase. - The visible body text contains a not-found phrase.
- The page is unusually short (lots of templating, no actual content).
- The final URL after redirects is the homepage or a search page.
- A canonical link points elsewhere.
Any one of these signals on its own produces false positives. A documentation page that legitimately explains 404 status codes will match signal 2. A homepage will match signal 4. A short article will match signal 3.
The reliable approach combines signals. Here is a starting Python implementation:
# soft_404.py
import re
from urllib.parse import urlparse
import httpx
# Multi-language not-found phrases. Order matters slightly: more
# specific patterns first so we exit early on confident matches.
NOT_FOUND_PATTERNS = [
# English
r"\bpage not found\b",
r"\bnot found\b",
r"404 error",
r"the page (you are looking for|requested) (does not|doesn't) exist",
r"this content (has moved|is no longer available)",
# Spanish
r"p\u00e1gina no encontrada",
# French
r"page introuvable",
r"page non trouv\u00e9e",
# German
r"seite nicht gefunden",
# Portuguese
r"p\u00e1gina n\u00e3o encontrada",
# Italian
r"pagina non trovata",
]
NOT_FOUND_RE = re.compile("|".join(NOT_FOUND_PATTERNS), re.IGNORECASE)
def is_soft_404(response: httpx.Response, original_url: str) -> bool:
if not (200 <= response.status_code < 300):
return False
body = response.text[:20_000] # cap to first ~20KB to bound work
# Signal 1: title says not-found.
title_match = re.search(r"<title[^>]*>([^<]+)</title>", body, re.IGNORECASE)
title = title_match.group(1) if title_match else ""
title_says_404 = bool(NOT_FOUND_RE.search(title))
# Signal 2: body says not-found, but only count matches in
# rendered content (skip <head> + <script> + <style>).
rendered = _strip_head_and_scripts(body)
body_says_404 = bool(NOT_FOUND_RE.search(rendered))
# Signal 3: final URL is the homepage despite a non-root request.
redirected_home = (
str(response.url) != original_url
and urlparse(str(response.url)).path in ("", "/")
and urlparse(original_url).path not in ("", "/")
)
# Combine: title is a strong signal. Body alone is medium.
# Redirected-to-home alone is medium. Two mediums beat one.
if title_says_404:
return True
if body_says_404 and redirected_home:
return True
if body_says_404 and len(rendered) < 800:
return True
return False
def _strip_head_and_scripts(html: str) -> str:
html = re.sub(r"<head[^>]*>.*?</head>", "", html, flags=re.DOTALL | re.IGNORECASE)
html = re.sub(r"<script[^>]*>.*?</script>", "", html, flags=re.DOTALL | re.IGNORECASE)
html = re.sub(r"<style[^>]*>.*?</style>", "", html, flags=re.DOTALL | re.IGNORECASE)
return htmlThis catches most server-rendered soft 404s and avoids the obvious false positives. It does not catch every case, but it is a solid 80%-solution that you can run against real workloads.
The hard cases
SPAs that render “404” client-side
A React, Vue, or Next.js app serving index.html for every path will return 200 with no “not found” phrase in the initial response. The rendered page in a real browser says “404,” but a non-rendering checker cannot see it.
The fix is to render the page in a headless browser and check the post-render DOM. Playwright works:
# playwright_soft_404.py
from playwright.async_api import async_playwright
async def is_soft_404_playwright(url: str) -> bool:
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
response = await page.goto(url, wait_until="networkidle", timeout=15000)
if not response or not (200 <= response.status < 300):
await browser.close()
return False
body_text = await page.inner_text("body")
await browser.close()
return bool(NOT_FOUND_RE.search(body_text))This works, but it's ten times slower than the regex approach and significantly more expensive in compute and memory. For a 75,000-URL job you do not want to render every URL. The pragmatic split: regex-check every URL, fall back to Playwright only on URLs that smell suspicious (very short content, no title, or an unusual response time).
Localized phrases we missed
Our regex list above covers English, Spanish, French, German, Portuguese, and Italian. Polish, Czech, Russian, Japanese, Chinese, Korean, Arabic, and dozens of others are not in there. If you are checking a multinational client's sites, you need to expand the pattern list (or accept the false negative rate).
False positives from legitimate “not found” content
A docs page titled “Handling 404 Errors in Django” will match every signal we wrote. The mitigation is to require the not-found phrase to be the dominant content, not incidental:
- The page is short (under 800 visible chars).
- OR the not-found phrase appears in the <title>.
- OR the URL was redirected to a different path that is the site root.
These three conditions together are the difference between “catches obvious soft 404s” and “catches obvious soft 404s without falsely flagging tutorial articles.”
What Bulk URL Checker does
Our soft-404 detector is the regex stage above, extended to roughly two dozen languages, plus a Playwright fallback for SPAs we detect by their response shape (small initial HTML, heavy client-side JS). Every URL gets a is_soft_404 boolean in the result. The full result envelope is documented in the REST API reference.
We tune the detector continuously based on customer reports of false positives and false negatives, which is one of the reasons buying makes sense for this specific check: every customer who finds a missed soft 404 makes the detector better for everyone.
If you want to try it on a real list, the free tier covers 300 URLs and tells you exactly which of them are hard 404s, soft 404s, redirects, or healthy. The next post in this series, How we built per-domain rate limiting for 75K-URL jobs, goes deeper on the queue + scheduler that makes the soft-404 detection affordable to run at scale.
Related Articles
How to Check for 404 Errors on Your Website →
Find and fix 404 errors hurting your SEO with Google Search Console, crawlers, and bulk checkers.
Free vs Paid Broken Link Checkers →
When free tools are enough and when you need a paid broken link checker.
How to Find Broken Links on Any Website (2026 Guide) →
Free methods, browser tools, and bulk checking to find and fix broken links on any website.