Engineering

Why Your Python URL Checker Stops Working at 5,000 URLs

CCarlos·May 28, 2026·10 min read

You wrote thirty lines of Python. It worked for fifty URLs. You ran it on five thousand and got back forty-eight hundred errors.

If you have built a URL checker in Python before, you already know the wall. The first thousand URLs work. Five thousand mostly work but the success rate drops to maybe ninety percent. Twenty-five thousand becomes a coin flip. Seventy-five thousand never finishes.

This post walks through exactly why a homegrown Python URL checker breaks at scale, with code snippets you can recognize from your own repo, and the fixes (or the buy-not-build alternative) for each failure mode.

The starting point: thirty lines of httpx

Almost every Python URL checker starts the same way:

# v1: works fine for 50 URLs, dies at 5,000
import asyncio
import httpx

async def check_one(client, url):
    try:
        r = await client.head(url, follow_redirects=True, timeout=10.0)
        return url, r.status_code
    except Exception as e:
        return url, str(e)

async def check_all(urls):
    async with httpx.AsyncClient() as client:
        return await asyncio.gather(*[check_one(client, u) for u in urls])

if __name__ == "__main__":
    with open("urls.txt") as f:
        urls = [line.strip() for line in f if line.strip()]
    results = asyncio.run(check_all(urls))
    for url, status in results:
        print(status, url)

For a small list this is fine. The script fans out, every request goes in parallel, and you get a flat list back in seconds. The problem starts when the list grows. Here is why, in order of how soon each failure mode bites.

Failure mode 1: your IP gets rate-limited

The first wall is the simplest. You are checking 5,000 URLs from a single IP address. Many of those URLs share a domain (or a CDN, or a hosting provider) that does not appreciate one client opening hundreds of connections in two seconds.

A typical signal pattern looks like this:

# Real output from a script hitting 5,000 URLs from one home IP
200 https://example.com/page-1
200 https://example.com/page-2
200 https://example.com/page-3
...
429 https://example.com/page-417
429 https://example.com/page-418
403 https://example.com/page-419
429 https://example.com/page-420
...

What is happening: the target server has detected too many requests from your IP in too short a window. CloudFlare, AWS WAF, and Akamai all do this by default. Some return 429 Too Many Requests with a Retry-After header. Some return 403 Forbidden with no explanation. Some quietly time out.

The naive fix is to add a semaphore:

# v2: limit to 50 concurrent requests
sem = asyncio.Semaphore(50)

async def check_one(client, url):
    async with sem:
        r = await client.head(url, ...)
        return url, r.status_code

This helps. It also slows your script down by 20x or more. And it does not fix the real problem, which is that you are still hitting the target from one IP. The smarter fix is a residential proxy pool, which costs $90+ per month from Webshare or Bright Data and adds a meaningful chunk of code to integrate.

Failure mode 2: per-domain throttling, not per-script

A subtler version of the same problem: the global concurrency limit is set to 50, but 800 of your URLs all live on shopify.com. Those 800 requests still go out in parallel to the same domain, and Shopify (rightly) rate-limits you immediately.

The fix is per-domain rate limiting. Conceptually simple, fiddly to implement well:

from collections import defaultdict
from urllib.parse import urlparse

domain_locks = defaultdict(lambda: asyncio.Semaphore(5))  # 5 per domain

async def check_one(client, url):
    domain = urlparse(url).netloc
    async with domain_locks[domain]:
        r = await client.head(url, ...)
        return url, r.status_code

The right per-domain limit varies by host. Shopify wants under 4 requests per second. Most WordPress sites tolerate 10 per second. A CloudFlare-fronted SaaS might block at 60. You learn this by trial and error, by watching for 429s and 503s, and by remembering which domains misbehaved on previous runs.

The robust version of this code grows to handle: bursting then settling, retry-after-aware backoff per domain, and a “this domain is hopeless right now, skip it and come back later” mode. That is a non-trivial side project on its own.

Failure mode 3: timeouts cascade and waste your concurrency budget

Default httpx.AsyncClient() has a 5-second timeout. You override it to 10 seconds because some of your URLs are slow. Now imagine 600 of those 5,000 URLs are actually dead: DNS does not resolve, the server is gone, or the connection hangs.

Each of those 600 requests holds a slot in your concurrency pool for ten full seconds. While they are stuck waiting to time out, your script is doing nothing. The real throughput drops by half. The fix is more nuanced timeouts:

# v3: separate connect / read timeouts
client = httpx.AsyncClient(
    timeout=httpx.Timeout(
        connect=5.0,   # DNS + TCP handshake; should be quick
        read=15.0,     # actual content download
        write=10.0,
        pool=2.0,      # waiting for a connection from the pool
    ),
    limits=httpx.Limits(max_connections=100, max_keepalive_connections=20),
)

Plus retry logic that distinguishes “transient” failures (TCP reset, connection timeout) from “permanent” ones (DNS NXDOMAIN, HTTP 404). Plus a circuit breaker that gives up on hosts after N consecutive failures so you do not spend half your run on a single dead domain.

Each layer adds a hundred lines of code and a week of tuning.

Failure mode 4: 200 OK can still mean broken (soft 404s)

Here is the one that catches every URL checker on its first real run: a URL returns 200 OK, but the page itself says “Page Not Found” or “Sorry, this article has moved.” The server is happily serving an error page with a 200 status code instead of a proper 404.

A naive checker reports these as healthy. Your client opens their broken-link report, follows one of the “working” URLs, and sees a 404 page. You look incompetent.

The fix is to actually inspect the body of every response that returns 2xx:

# v4: detect soft 404s
SOFT_404_PATTERNS = [
    "page not found",
    "not found",
    "this page doesn't exist",
    "404 error",
    "the page you are looking for",
    "sorry, this content",
]

async def check_one(client, url):
    r = await client.get(url, follow_redirects=True, timeout=...)
    if 200 <= r.status_code < 300:
        body = r.text.lower()[:5000]  # only first 5KB
        if any(p in body for p in SOFT_404_PATTERNS):
            return url, "soft_404"
    return url, r.status_code

This is a starting point. The real version handles localized pages (German “Seite nicht gefunden”, French “Page introuvable”), checks the <title> separately, and ignores false positives on documentation pages that legitimately mention “404”. You are now writing a small NLP system as a side project to your URL checker.

You also just changed every request from HEAD to GET, which means downloading the full HTML for every URL. Your bandwidth bill triples. Your script is slower. The CDN sees more weight from you and gets more annoyed.

Failure mode 5: retry classification is its own can of worms

A request fails. Should you retry it?

429 with Retry-After: yes, after the header value.
429 without Retry-After: yes, with exponential backoff.
503: yes, slowly.
403: maybe. From a CDN, you are being blocked. From an origin, the resource really is forbidden.
500 / 502 / 504: yes, but with a different curve.
404: no, that is a real answer.
ConnectionResetError: maybe. Could be flaky middlebox, could be the host blocking you.
SSLError: no, the cert is broken, that is a real answer.
asyncio.TimeoutError: maybe. Could be transient, could be a hung server.

Each of these decisions involves trade-offs. Retry too aggressively and you waste hours on dead hosts. Retry too conservatively and you mark working URLs as broken because of one flaky moment.

Most URL checkers get this wrong for months before they get it right. Every wrong decision costs you a re-run, and at 75,000 URLs a re-run is hours.

The combined effect: a graph that flatlines

If you plot a homegrown URL checker's “URLs successfully checked per minute” over a long run, the graph looks like a sigmoid curve flipped upside down. The first thousand fly through. The second thousand are slower because per-IP rate limits are starting to hit. By the fifth thousand you are dominated by timeouts and retries on a few stuck domains. By twenty-five thousand the script is mostly idle, waiting on backoff windows it cannot escape.

You can fix this. People do. The fix is six to eight weeks of focused work and a few hundred dollars per month of residential proxy budget. If you enjoy infrastructure work and have the time, building it is genuinely interesting.

Or: don't build it

The case for using a managed service is straightforward. Someone else has already done all five of the above. They have a residential proxy pool, per-domain rate limiting tuned across thousands of hosts, retry classification that handles every weird edge case, soft-404 detection that works in multiple languages, and a job queue that resumes after crashes.

We built Bulk URL Checker precisely for this. Our SDK is five lines of Python:

# pip install bulkurlchecker
from bulkurlchecker import Client

client = Client(api_key="uck_live_YOUR_KEY")
results = client.check_urls(my_75000_urls)  # blocks until done

for r in results.broken:
    print(r.status_code, r.url, "->", r.final_url)

The first 300 URLs are free, no credit card. Past that, paid plans start at $9/month. You can read the full REST API reference, browse copy-paste recipes for common patterns, or just try it from the CLI:

pip install 'bulkurlchecker[cli]'
export BULKURLCHECKER_API_KEY=uck_live_...
bulkurlchecker check my_urls.txt --output csv > report.csv

If you would rather keep building your own, that is a legitimate call. The next post in this series, The hidden $500/mo cost of rolling your own bulk URL checker, walks through the actual costs (engineering hours, proxy bills, downtime) so the buy-versus-build decision has real numbers.

Either way: please do not ship a URL checker that breaks silently at 5,000 URLs. Your future self will not forgive you when the next big audit job rolls in.

Why Your Python URL Checker Stops Working at 5,000 URLs

The starting point: thirty lines of httpx

Failure mode 1: your IP gets rate-limited

Failure mode 2: per-domain throttling, not per-script

Failure mode 3: timeouts cascade and waste your concurrency budget

Failure mode 4: 200 OK can still mean broken (soft 404s)

Failure mode 5: retry classification is its own can of worms

The combined effect: a graph that flatlines

Or: don't build it

Related Articles

How to Check for 404 Errors on Your Website →

Free vs Paid Broken Link Checkers →

How to Find Broken Links on Any Website (2026 Guide) →