Why pyhaul exists¶
See also: Design · Specification
I wanted something that made resuming HTTP downloads easy. Not just "add a Range header" easy — actually easy, where you call a function, it either finishes or it doesn't, and in both cases the state on disk is correct.
Bring your own transport¶
You hack on different codebases — some are async, some are legacy
requests code, some use totally custom in-house HTTP stacks. They
all share the same abstractions, they just expose them differently.
pyhaul should work with any of them.
Zero dependencies¶
The base pyhaul package is pure Python with zero required
dependencies. You can't always install a native binary like aria2
(as good as it is). The HTTP client adapters are optional extras you
install if you want them.
Real sync and async¶
Not a fake async wrapper around sync code — a real async engine that
shares logic with the sync path but uses async with / async for
natively.
Not a session manager, not a retry policy¶
pyhaul shouldn't initiate connections, create or own sessions, or
contain its own retry or threading logic. I wanted to use it in
contexts where sessions come out of a pool, where they're configured
over separate proxies, where app code has added custom authentication
headers. One haul() = one request. Retries are the caller's
concern.
Not a weekend project¶
And it had to correctly handle the weird, real-world corner cases that cause silent corruption. Whether that's a hardware crash, a flaky connection that drops every few seconds, your container getting OOM-killed, or a CDN edge server doing something unexpected with compression or range boundaries.
When you first look at "resumable HTTP downloads," it seems fairly simple. Then you hit the first bug. You fix it. Then files are sometimes truncated. Oh, this particular client library auto-negotiates gzip compression — how does that interact with byte ranges? What about chunked transfer encoding? What about servers that just ignore your Range header entirely? What about a server that was serving version A of a file when you started, and version B when you resume two days later?
And then there's the streaming layer itself. You can't use your HTTP
library's convenient high-level body methods —
response.iter_content() in requests, response.iter_bytes() in
httpx — because those transparently decompress the body. Even when
the server respects Accept-Encoding: identity, the decompressed
byte count won't match the Content-Range or Content-Length values
the server sent, so your checkpoint cursor drifts and your next
resume starts from the wrong offset. Instead you have to drop down
to the raw streaming layer: response.raw.stream(decode_content=False)
for requests/urllib3, response.iter_raw() / response.aiter_raw()
for httpx and niquests. These give you post-transfer-encoding,
pre-content-encoding bytes — the exact bytes the server framed. But
now you're responsible for understanding that "raw" doesn't mean
"straight off the socket": on a persistent HTTP/1.1 connection,
multiple responses share the same TCP stream, delimited by
Content-Length or chunked transfer encoding. The library still
handles that framing for you in the raw layer, but if you get the
byte accounting wrong — or if the server lies about Content-Length,
or a proxy silently re-compresses the stream despite no-transform —
you can overread into the next response on the connection, corrupting
both it and your download. These are the kinds of edges that only
show up in production, on somebody else's CDN, at 3 AM.
How existing tools handle resume¶
I looked at curl, wget, and aria2 — mature, battle-tested tools — to see how they approach these problems. Verified against their source code, not their documentation.
| curl | wget | aria2 | pyhaul | |
|---|---|---|---|---|
ETag / If-Range on resume |
✗ | ✗ | ✗ | ✓ |
| Graceful 200 recovery | ✗ | ✗ | ✗ | ✓ |
| Range-safe compression | ✗ | partial | ✗ | ✓ |
Cache-Control: no-transform |
✗ | ✗ | ✗ | ✓ |
| Data flushed before checkpoint | ✗ | ✗ | partial | ✓ |
| Atomic file completion | ✗ | ✗ | ✗ | ✓ |
| Pure Python / embeddable | ✗ | ✗ | ✗ | ✓ |
Notes on the table above:
- curl + 200 recovery: returns
CURLE_RANGE_ERRORand stops. - wget + 200 recovery: skips leading bytes and appends. Works if the resource hasn't changed; silently corrupts if it has, because there's no ETag check.
- aria2 + 200 recovery: aborts with
CANNOT_RESUME. - wget + compression: sends
Accept-Encoding: identity(good), but doesn't sendCache-Control: no-transform, so an intermediate proxy or CDN can still re-compress the response. - aria2 + flush ordering: calls
fsyncon close/save, but doesn't enforce strict data-before-checkpoint ordering.
None of these tools send If-Range with an ETag on resume. That
means if the resource at a URL changes between your first attempt and
your retry, all three will either fail or silently produce a file
that's half version A and half version B. pyhaul records the ETag
the server sent on the first request and sends it back via If-Range
on every subsequent attempt. If the ETag still matches, the server
appends. If it doesn't, the server returns the full new resource from
byte 0 — no corruption.
The details¶
Range requests that might be ignored¶
The server can honor the range (206 Partial Content), decide to
resend the whole thing (200 OK), or reject it as out of bounds
(416). All three happen in the wild. pyhaul handles each outcome
without caller intervention — including rewinding the on-disk cursor
when the server signals that prior progress is invalid (e.g. a 200
to a resume request), so fresh bytes are never spliced onto a stale
prefix.
Mid-download resource changes¶
Suppose your network drops and you retry 48 hours later. By then the
object at that URL may have changed. pyhaul records the
ETag
the server sent — a short fingerprint of that exact representation —
and on resume sends it back via If-Range. If the current ETag still
matches, the server splices the next bytes onto the ones already on
disk. If it doesn't, the server returns the new body from byte 0
instead of corrupting the file with a half-old, half-new mix.
Lengths nobody knows in advance¶
Some servers send Content-Length: 0, omit the header, or use chunked
transfer encoding because they don't know (or don't want to compute)
the full length up front. Chunked TE is HTTP's intended mechanism for
exactly this: length-prefixed chunks ending in a zero-length sentinel
that means "that's the end." The cost is that neither side knows the
total until the last chunk arrives — easy to mishandle if your code
assumes Content-Length is always present. pyhaul streams to the
sentinel and treats the final byte count as authoritative.
Compression that interacts with ranges¶
HTTP responses can come back compressed (gzip, brotli, zstd), and
compression isn't guaranteed to produce the same output bytes for the
same input — two different servers, or even the same server on
different days, can compress the same asset slightly differently. That
means asking for "bytes 1,000,000 onward" inside a compressed stream
isn't a stable question; what you actually want is bytes 1,000,000
onward of the original, uncompressed file. Worse, some CDN edges
store only a compressed copy of an asset, so a careless resume
request can come back gzipped even when the bytes already saved on
disk are uncompressed — concatenate the two and the file is silently
corrupted from that point forward. pyhaul defends against this with
request-level headers that force an uncompressed byte stream end to
end:
Accept-Encoding: identity
asks the origin server not to compress; Cache-Control: no-transform
(RFC 9111 §5.2.2.6)
forbids every proxy and CDN in the chain from re-compressing on the
way back; and Range / If-Range are always set by pyhaul itself
even if the caller tried to override them.
Crash safety¶
The order of writes matters. pyhaul flushes the .part data with
fdatasync before updating the checkpoint, writes the checkpoint
atomically (write-to-tmp, fsync, rename), and finalizes the download
by atomically renaming .part onto the destination path. Sidecar
files are constructed as siblings of the destination, so every rename
is on the same filesystem and truly atomic. Data bytes are never marked
committed until they are on disk; the destination path is only ever
populated by that final rename.
The payoff is a set of guarantees you can verify by inspecting a directory with no other context:
- If the destination file exists at the requested path, it is complete and uncorrupted. There is no in-between state where a partially-written file sits at the final name.
- If
.partand.part.ctrlexist instead, the firstvalid_lengthbytes of.partare durable and correct. The rest is junk from preallocation or unflushed writes that gets trimmed on completion. A subsequenthaul()call against the same destination picks up from bytevalid_length.
These invariants sound obvious. Enforcing them through process kills,
OS-dependent sparse-allocation semantics, and writes that linger in
kernel buffers is where most home-grown downloaders silently lose
data. pyhaul handles the filesystem primitives correctly on Linux,
macOS, and Windows.