Scraping is not a dark art, it is an operations problem that lives or dies by measurement. On today’s web, large traffic studies consistently show that roughly half of all requests are automated, which means your crawlers share the road with a heavy mix of bots and defensive systems. Treating quality, cost, and reliability as numbers rather than gut feel is the only sustainable way to scale.
Define success with outcome metrics, not just status codes
A 200 is not success if the DOM never renders the target selectors or if the payload is a soft block. A practical success metric combines three checks: an acceptable status code, content completeness against required selectors, and a checksum or schema validation that guards against honeypot HTML. When teams adopt this composite definition, they usually discover that their apparent success rate drops, then stabilizes at a more truthful baseline that actually predicts downstream data usability.
A useful reality check starts with page complexity. Independent web measurements put the median page near 2 MB with roughly 75 network requests. If your workload captures raw HTML only and disables media, your transfer cost per page tends to land well below that 2 MB figure, yet JavaScript still dominates the budget. Median JavaScript bytes hover around the mid hundreds of kilobytes, which is enough to tilt the economics toward headless rendering for many targets. That weight also explains why a naïve success metric can overstate quality, since scripts that silently fail leave a clean 200 with missing product data or prices.
Error budgets and retry math that protect your wallet
Error budgets give you a clean way to align reliability with spend. Suppose your base failure rate per attempt is 10 percent. With up to two retries, the chance that all attempts fail is 0.1 × 0.1 × 0.1, which is 0.1 percent. Your effective success probability rises to 99.9 percent, but your request volume increases by a factor of 1.11 on average. That trade is usually excellent for ingestion pipelines that need near perfect completeness but can tolerate slightly higher cost. If your base failure is 25 percent, two retries still lift you to 98.4 percent success, at 1.33 times the volume. Numbers like these clarify when to invest in smarter routing rather than throwing more retries at the problem.
Distribution matters as much as averages. Track the 95th percentile time to first byte and time to complete render for target domains. On pages with heavy scripts, the long tail often dominates the day, so a mean under two seconds can hide timeouts that quietly erase several percentage points of your yield. Moving those tail latencies down is often as simple as caching static resources and increasing renderer concurrency, while keeping per host politeness in place.
Proxies, geolocation, and measurable health
Proxy selection is a measurable decision, not a preference. For geo locked catalogs, the delta between success rates by region can exceed ten percentage points even with identical code paths, purely because of IP reputation and local middleware. Measure each pool on three axes that actually predict outcomes: connect latency to first byte, authorization failure rate, and percentage of pages that meet your content completeness checks. Before a big run, validate at small scale and retire underperforming subnets. To make this routine rather than a one off, keep a simple daily health check that probes your pools and verify them with a quick proxy online test so you catch degradation before it hurts a harvest.
Headless where it counts, lightweight where it does not
Headless browsers are not a religion. Use them where JavaScript gates the data, then drop back to fast HTTP sessions when HTML is already structured. A clean rule is to promote a target to headless only after a smoke test shows that required selectors appear post execution or that prices and inventory live behind client side requests. This keeps your average compute per page low while still covering complex storefronts. Track renderer CPU seconds per successful page as a first class metric. When that number rises, it usually signals new anti bot scripts, bloated widgets, or accidental over rendering in your code.
Throughput planning with simple, defensible math
Capacity planning is easier when you push guesswork aside. If you need 10 million usable pages and your composite success rate is 92 percent, you must budget 10.87 million attempts. With a median transfer near 2 MB per full page and a leaner 0.6 to 1.0 MB for HTML plus essential XHR, your raw bandwidth budget spans roughly 6.5 to 20 terabytes, before protocol overhead. Holding a one request per second politeness limit per host, you will process 86,400 pages per host per day. If your catalog spans 200 distinct hosts, a single day at that limit yields about 17 million attempts, which gives you comfortable headroom for retries and slow segments.
What to log so your numbers stay honest
Reliable measurement depends on clean observability. Capture, for every attempt, the request template version, IP type and ASN, rendered vs raw fetch, the exact selectors checked, and a compact fingerprint of the normalized content. With that in place, you can quantify duplicates, attribute failures to the right layer, and compute true success at any time window without re crawling. The result is a scraper that behaves like a mature service, with predictable costs and quality grounded in data rather than guesswork.
The takeaway is simple. When your team defines success carefully, uses conservative, widely observed page weight data to size workloads, and treats retries and proxies as measurable levers, scraping shifts from fragile to boring. Boring, in this context, is exactly what you want.