Engineering

How We Built Per-Domain Rate Limiting for 75K-URL Jobs

CCarlos·May 28, 2026·11 min read

A naive bulk URL checker treats every URL the same: pull it off the queue, fire the request, move on. That breaks the moment one of the customer's lists has 8,000 URLs on the same domain. The worker fans them all out, the target returns 429s for forty-five minutes straight, and the job runs three times slower than it should.

This post is the architecture writeup of how we keep that from happening at Bulk URL Checker. It is a Redis-backed sliding-window rate limiter, a domain-aware scheduler, and a backoff policy tuned across thousands of real hosts. The detail level is high; if you are evaluating whether to build this yourself, the answer is probably “use a managed service,” but if you want to know what is under the hood, here it is.

The problem, precisely

A job at Bulk URL Checker is a list of 1 to 75,000 URLs. The URLs are not uniformly distributed across domains. A typical job's domain histogram looks like:

One or two domains with 1,000+ URLs (the customer's own site).
A long tail of 5 to 50 domains with 20-200 URLs each (sites the customer links to).
Hundreds of domains with single-digit URLs (random external references).

Our SLA is “75,000 URLs finish in under 30 minutes.” To hit that, we need to maintain about 50 URLs/second of throughput. We also need to do that without getting rate-limited by any single host, because retries are five to one hundred times more expensive than first-time attempts.

The constraint that makes this hard: those two goals point in opposite directions. High throughput wants maximum parallelism. Polite per-domain behavior wants very little parallelism on the busiest domain. We need to be parallel across domains and serial within them.

The architecture

The checker is built around three Redis data structures and a single worker loop.

1. The pending queue (per job)

On job submit, we INSERT every URL into a Postgres url_items table with status = 'pending'. Workers pull from this table directly via:

-- claim N URLs the worker can actually start right now
WITH next AS (
  SELECT id, url, domain
  FROM url_items
  WHERE status = 'pending'
    AND job_id = $1
  ORDER BY id
  FOR UPDATE SKIP LOCKED
  LIMIT $2
)
UPDATE url_items
SET status = 'claimed', claimed_at = NOW()
FROM next
WHERE url_items.id = next.id
RETURNING url_items.id, url_items.url, url_items.domain;

FOR UPDATE SKIP LOCKED is the trick that lets ten worker replicas share the same queue without coordinating. Each worker claims its own batch; nobody waits on anyone else.

2. The per-domain throttle (Redis)

Before actually firing a request, the worker checks a per-domain budget in Redis:

# pseudo-Python for the throttle check
def can_fire(domain: str, rate_per_sec: int) -> bool:
    key = f"throttle:{domain}"
    now_ms = int(time.time() * 1000)
    window_start = now_ms - 1000  # last 1 second
    pipe = redis.pipeline()
    pipe.zremrangebyscore(key, 0, window_start)  # drop old hits
    pipe.zcard(key)                              # count current hits
    pipe.zadd(key, {str(now_ms): now_ms})        # tentatively add ours
    pipe.expire(key, 5)                          # auto-cleanup
    _, current_count, _, _ = pipe.execute()
    if current_count >= rate_per_sec:
        # roll back the speculative add and signal "wait"
        redis.zrem(key, str(now_ms))
        return False
    return True

This is a sliding-window rate limiter over a Redis sorted set. It is exact (no leaky-bucket approximation) and works correctly across multiple worker processes. The cost is one round-trip per request, about 0.5ms in practice.

3. The domain rate table

Where do we get rate_per_sec? From a per-domain table we maintain based on observed behavior:

CREATE TABLE domain_rates (
    domain         VARCHAR(255) PRIMARY KEY,
    rate_per_sec   SMALLINT     NOT NULL DEFAULT 10,
    last_429_at    TIMESTAMPTZ,
    consecutive_429s SMALLINT   NOT NULL DEFAULT 0,
    updated_at     TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

New domains start at 10 req/sec. Every time we see a 429 from a domain, we halve the rate. Every successful streak of 100 requests, we increase by one. The table converges to something close to each host's actual tolerance within a few large jobs.

4. The worker loop

Each of the ten worker replicas runs:

while True:
    batch = claim_n_urls_from_db(job_id, limit=50)
    if not batch:
        sleep(2); continue

    grouped = group_by_domain(batch)
    for domain, urls in grouped.items():
        rate = lookup_or_default(domain)
        for url in urls:
            while not can_fire(domain, rate):
                sleep(0.05)
            result = check_one(url)
            persist_result(url.id, result)
            if result.status == 429:
                shrink_rate(domain)
            elif result.is_success:
                maybe_grow_rate(domain)

The inner while not can_fire loop is the bit that keeps us polite. If 8,000 URLs all live on shopify.com, those URLs serialize against each other across all workers, even though every other domain runs in parallel.

The starvation problem (and the fix)

A naive implementation of the above has a bug: workers that grab a Shopify-heavy batch sit waiting on the throttle while workers with diverse batches sprint ahead. The job's overall throughput drops to whatever the worst-throttled domain allows.

Our fix: when a worker's can_fire check fails, it requeues the URL with a retry_at hint and immediately fetches a new batch that excludes the throttled domain. The throttled URLs come back around naturally when the window opens.

# When throttled, skip this URL for now and grab a fresh batch
def claim_batch_excluding(job_id, excluded_domains, limit=50):
    # SQL: WHERE domain NOT IN (...) ORDER BY id LIMIT N FOR UPDATE SKIP LOCKED
    ...

With this in place, the slowest domain on a job no longer caps the whole job's throughput. Total job time becomes max(URLs_on_slowest_domain / domain_rate, total_URLs / global_throughput), which is the theoretically optimal answer.

What we got wrong (and fixed)

Three notable bugs we shipped before we got it right:

The throttle was per-pod, not per-cluster. The first version held the sliding-window state in Python memory, so each of the ten workers had its own counter. Total request rate to a single domain was 10x what we thought. Moving the state to Redis fixed it. (This is why the snippets above use Redis pipelines, not Python locks.)
The rate table never shrank. We had logic to halve the rate on 429s, but no logic to halve again if the next 429 came in within the same window. A site that started 429ing hard would slowly limp back up to its previous rate, only to get 429d again two minutes later. Adding consecutive_429s to the schema and applying exponential decay fixed it.
We mis-grouped subdomains. Initial naive code keyed the throttle on urlparse(url).netloc, treating cdn.example.com and www.example.com as separate domains for rate-limiting purposes. They share infrastructure, so we were getting double-rate-limited. The fix: key on the registrable domain (using tldextract) for most cases, with an allowlist for sites where the subdomain genuinely is its own host.

What this is worth as a customer

Per-domain rate limiting is one of those features that does not show up in a feature comparison chart but quietly determines whether a 75K-URL job finishes in 25 minutes or 6 hours. It is also one of the hardest pieces of the system to get right, because it touches every layer: queue, scheduler, retry policy, observability.

If you are evaluating whether to build or buy a bulk URL checker, the per-domain rate limiter is the kind of thing where the buy column quietly accumulates value. Our free tier covers 300 URLs if you want to try it on a real list. The REST API exposes everything described above without you having to wire any of it yourself.

The next post in this series, Bulk URL checking in 10 lines of Python (without writing the crawler), is the SDK demo that puts all of this behind five method calls.