The protocol-level choices that quietly decide scraping success
Most scraping failures are blamed on parsers or proxies. In practice, a large share of reliability and cost comes from network and protocol decisions made long before HTML hits your extractor. The web you connect to is encrypted, multiplexed, proxied, and compressed by default, and each of those layers comes with measurable consequences for throughput, ban risk, and spend.
Over 90% of page loads in common browsers are now served over HTTPS. That makes TLS handshakes, cipher selection, and session reuse first-order concerns for scrapers. With TLS 1.3, a full handshake typically completes in a single round trip, and resumption can cut that to near zero with 0‑RTT where supported. When your median inter-region round trip time sits at triple digits in milliseconds, saving even one round trip per connection is a concrete capacity gain at scale. Multiply that by millions of connections and the latency reclaimed becomes crawler uptime.
CDNs dominate delivery for popular properties. Cloudflare alone fronts about one-fifth of all websites, and other large networks handle a substantial portion of the rest. This changes how origin reachability, rate limiting, and error codes present in the wild. Many edges enforce per-IP budgets and reinforce rules at L7, so connection style matters as much as count. A scraper that opens controlled, long-lived sessions and leans on protocol features to minimize churn will generate fewer telltale spikes than one that thrashes sockets.
HTTP/2 and HTTP/3 alter concurrency math
HTTP/1.1’s de facto browser limit is around six connections per host. Scrapers often exceed that, but the penalty is familiar: more sockets, more handshakes, more cross-talk with bot defenses. HTTP/2 changes the equation with multiplexing, letting many streams share one connection. Roughly half of sites enable HTTP/2, and a meaningful share already offer HTTP/3. On lossy paths, QUIC in HTTP/3 avoids some head-of-line blocking seen with TCP and tends to improve tail latency. The net effect is practical: fewer sockets per host for the same parallelism, lower handshake overhead, and smoother request pacing.
This also affects backoff and retry behavior. On HTTP/1.1, an aggressive retry plan can explode connection counts. On HTTP/2 or HTTP/3, it is feasible to retry judiciously without multiplying sockets, because streams are cheap while connections are precious. Designing your executor around stream concurrency, not socket concurrency, usually reduces error amplification.
Compression and payload mix decide bandwidth budgets
Median mobile pages now exceed 2 MB in transfer size, and JavaScript alone commonly accounts for hundreds of kilobytes. Text assets dominate scraping workloads, so the compression choice is material. Brotli consistently produces 15 to 20 percent smaller text than gzip at comparable levels. If your crawler fetches 1 million HTML and JSON responses per month at a median of 300 KB each after gzip, shifting to Brotli where available trims roughly 45 to 60 GB of transfer. On metered egress or paid proxy networks, that is not a rounding error.
Compression also shapes CPU time on both sides. High Brotli levels can be slow server-side, but many CDNs cache precompressed variants and negotiate efficiently. Client-side, a modern decompressor handles Brotli well, and fewer bytes in flight shorten occupied connection time, freeing capacity for more targets without extra IPs.
TLS behavior affects both speed and fingerprints
Cipher suites, extensions, and ordering create a stable fingerprint unless deliberately managed. CDNs and large properties often score clients by handshake features alongside request headers. TLS 1.3 narrows the surface, but differences remain. Normalizing your TLS stack across languages and runtimes avoids accidental uniqueness, and enabling session resumption reduces the number of full handshakes you perform. Fewer full handshakes means fewer opportunities to trip rate limits that count connection establishments, not just requests.
IPv6 is not optional anymore
Around 40 percent of users reach large destinations over IPv6, and many edges route IPv6 traffic differently from IPv4. On some networks, IPv6 paths have lower latency to the same CDN edge. More importantly, address space is abundant, and certain providers apply different reputation controls to IPv6 pools. Dual-stack capability increases reachable surface and gives you more room to distribute load without cranking up per-address request rates.
Input hygiene reduces avoidable failures
A surprising amount of noise still comes from malformed proxy credentials, inconsistent URI schemes, and mixed IPv4 and IPv6 notations. Standardizing inputs before your orchestrator consumes them is cheap insurance. Even simple steps like validating auth blocks, normalizing schemes, and ensuring uniform host:port formatting cut connection errors and wasted retries. If you maintain large rotating lists, a lightweight proxy formatter helps keep feeds consistent across teams and tools, and that consistency shows up instantly in connection success metrics.
What to measure to prove it
If you want evidence, track the basics by protocol: handshake count per successful response, average streams per connection, compression ratio by content-type, bytes-per-successful-HTML, and retry inflation rate under congestion. When teams switch from many short HTTP/1.1 connections to fewer long-lived HTTP/2 or HTTP/3 connections, handshake counts per success typically drop, bytes per success fall with better compression, and retries flatten because multiplexing absorbs transient delays. Those are not cosmetic improvements; they compound into higher success rates and lower spend.
The crawler that treats protocols as levers rather than background detail will run quieter, faster, and cheaper. Most of the gains are mechanical. Pick the newer transport when offered, reuse connections, accept stronger compression, keep fingerprints boring, and keep inputs clean. The statistics of the modern web favor anyone who does these small things reliably.