How to Check URLs in Bulk: Guide for Developers
If you maintain documentation, knowledge bases, or content databases with thousands of URLs, manual link checking is not an option. A broken link in your API docs costs developer trust. A dead URL in your knowledge base creates a support ticket. At 10,000+ links, you need automation.
This guide covers three approaches: a DIY Python script, cloud-based bulk checking, and CI/CD integration, so you can pick what fits your scale.
Why Bulk URL Checking Matters
Broken links in developer-facing content cause real damage:
- API documentation with hundreds of endpoint references that go stale after versioning
- Internal knowledge bases (Confluence, Notion, GitBook) with cross-linked articles that break during restructuring
- Content databases with aggregated links from multiple sources
- Third-party integrations referencing external URLs that disappear
When you have 30,000+ URLs across these systems, even a 2% monthly breakage rate means 600 broken links per month.
Approach 1: DIY Python Script
For developers who want full control, here is a concurrent URL checker with retry logic:
1import requests
2import csv
3import time
4from concurrent.futures import ThreadPoolExecutor
5from requests.adapters import HTTPAdapter
6from urllib3.util.retry import Retry
7
8def create_session():
9 session = requests.Session()
10 retries = Retry(
11 total=3,
12 backoff_factor=1,
13 status_forcelist=[429, 500, 502, 503, 504]
14 )
15 adapter = HTTPAdapter(max_retries=retries)
16 session.mount('http://', adapter)
17 session.mount('https://', adapter)
18 return session
19
20def check_url(url, session):
21 try:
22 response = session.head(url, timeout=10, allow_redirects=True)
23 return {
24 'url': url,
25 'status_code': response.status_code,
26 'final_url': response.url,
27 'response_time': response.elapsed.total_seconds()
28 }
29 except requests.exceptions.RequestException as e:
30 return {
31 'url': url,
32 'status_code': 'ERROR',
33 'final_url': url,
34 'error': str(e)
35 }
36
37def check_urls_bulk(urls, max_workers=10):
38 session = create_session()
39 results = []
40 with ThreadPoolExecutor(max_workers=max_workers) as executor:
41 futures = [executor.submit(check_url, url, session) for url in urls]
42 for future in futures:
43 results.append(future.result())
44 time.sleep(0.1) # basic rate limiting
45 return results
46
47# Usage
48urls = open('urls.txt').read().splitlines()
49results = check_urls_bulk(urls)
50
51# Save results
52with open('results.csv', 'w', newline='') as f:
53 writer = csv.DictWriter(f, fieldnames=['url', 'status_code', 'final_url', 'response_time'])
54 writer.writeheader()
55 writer.writerows(results)Where this breaks down
- Hits rate limits (429 errors) at 5,000+ URLs with no proxy rotation
- Runs on your machine, ties up local resources for hours on large batches
- No soft 404 detection (pages that return 200 but show error content)
- No redirect chain tracking
- No persistent history or trend tracking
This approach works for one-off checks under 5,000 URLs. Beyond that, you need infrastructure.
Approach 2: Cloud-Based Bulk URL Checker
For production workloads with 10,000-75,000 URLs, a cloud-based checker solves the scaling problems:
- Upload a CSV and walk away. Cloud infrastructure processes your batch. You get an email when the report is ready.
- Automatic proxy rotation. Handles 429/403 errors so your entire batch completes, every time.
- Proxy rotation. Automatically rotates IPs to avoid rate limits and false positives.
- CSV/JSON export. Get results in developer-friendly formats for programmatic analysis.
- Dashboard with filtering. Search by status code, filter broken links, view redirect chains.
Check Up to 75,000 URLs, Free to Start
Upload your CSV, get your report by email. 300 free URL checks, no credit card required.
Check URLs Free โReal Example: Checking a Documentation Site
Here is a practical workflow for checking links in a documentation site with 5,000+ pages.
Step 1: Extract URLs
For static site generators (Next.js, Hugo, Gatsby, MkDocs):
1# Extract all external links from built HTML
2grep -r -o 'https://[^"]*' ./build > urls.txt
3
4# Deduplicate
5sort -u urls.txt > unique_urls.txt
6echo "Found $(wc -l < unique_urls.txt) unique URLs"Step 2: Convert to CSV
1echo "url" > urls.csv
2cat unique_urls.txt >> urls.csvStep 3: Upload and check
Upload the CSV to a bulk URL checker. For 5,000 URLs, processing typically takes 1-3 hours depending on target server response times.
Step 4: Filter broken links
1import pandas as pd
2
3df = pd.read_csv('results.csv')
4
5# Find broken links (4xx and 5xx)
6broken = df[df['status_code'] >= 400]
7print(f"Found {len(broken)} broken links")
8
9# Group by status code
10print(broken.groupby('status_code')['url'].count())
11
12# Export for fixing
13broken.to_csv('broken_links.csv', index=False)CI/CD Integration
For continuous validation, run URL checks as part of your deployment pipeline:
1# .github/workflows/check-links.yml
2name: Check Documentation Links
3
4on:
5 schedule:
6 - cron: '0 0 * * 1' # Every Monday
7 workflow_dispatch:
8
9jobs:
10 check-links:
11 runs-on: ubuntu-latest
12 steps:
13 - uses: actions/checkout@v4
14
15 - name: Extract URLs from docs
16 run: |
17 grep -r -o 'https://[^"]*' ./docs/build > urls.txt
18 sort -u urls.txt > unique_urls.txt
19 echo "url" > urls.csv
20 cat unique_urls.txt >> urls.csv
21 echo "Extracted $(wc -l < unique_urls.txt) URLs"
22
23 - name: Upload to Bulk URL Checker
24 run: |
25 # Upload CSV to your bulk checker
26 # Process results and flag broken links
27 echo "Upload urls.csv to app.bulkurlchecker.com"
28 echo "Review results in dashboard"Comparison: Which Approach to Use
| Feature | DIY Script | Desktop Tool | Cloud-Based |
|---|---|---|---|
| Max URLs (practical) | ~5,000 | ~10,000 | 75,000 |
| Rate limit handling | Manual | Limited | Automatic (proxy rotation) |
| Soft 404 detection | No | No | No |
| Local resources | High | High | None |
| Babysitting required | Yes | Yes | No |
| Cost | Free | ยฃ149/year | From $9.99 |
Under 5,000 URLs: The Python script works fine. 5,000-10,000: Desktop tools like Screaming Frog can handle it. 10,000-75,000: Cloud-based bulk checkers are the only practical option.
Best Practices
- Check regularly. Monthly checks catch broken links before users find them.
- Prioritize internal links. Broken internal links hurt SEO more than external ones.
- Track redirect chains. Long chains (3+ hops) slow page load and should be shortened.
- Monitor response times. Slow external resources affect your page performance.
- Use proper User-Agent headers. Identify yourself to avoid being blocked as a bot.
For more on keeping documentation links healthy, see our dedicated guide. You can also try our batch URL checker for one-off list checks, or the mass URL checker for large-scale validation.
Ready to Check Your URLs?
300 free URL checks. No credit card. Upload your CSV and get your report by email.
Check URLs Free โRelated Articles
How to Check for 404 Errors on Your Website โ
Find and fix 404 errors hurting your SEO with Google Search Console, crawlers, and bulk checkers.
Free vs Paid Broken Link Checkers โ
When free tools are enough and when you need a paid broken link checker.
How to Find Broken Links on Any Website (2026 Guide) โ
Free methods, browser tools, and bulk checking to find and fix broken links on any website.