Design & Architecture¶
See also: Why pyhaul Exists · Control File Spec
How resume works¶
haul()reads.part.ctrl(if it exists) to recover the cursor position and stored ETag.- Sends
Range: bytes=<cursor>-withIf-Range: <etag>(omitted when no ETag is stored). - 206 Partial Content — server honors the range. Stream appends from the cursor.
- 200 OK — server ignores the range (resource changed, or server doesn't support ranges). Cursor resets to 0; stream overwrites from the beginning.
- 416 Range Not Satisfiable — the server's reported total matches the cursor (already complete) or the representation shrank (checkpoint reset, next call restarts).
The engine handles each case without caller intervention.
Download lifecycle¶
- In-flight. Two sidecar files:
<dest>.part(data) and<dest>.part.ctrl(binary checkpoint with cursor position, ETag, block-level hashes, etc.) - Interrupted. Both files remain. Next
haul()resumes automatically. - Complete.
.partis atomically renamed todest;.ctrlis deleted. SHA-256 is computed and returned. - Discard. Delete both
.partand.part.ctrlto force a restart.
Sidecar file naming and preflight¶
All sidecar files are derived from the caller-provided destination path by appending suffixes:
| File | Derivation | Purpose |
|---|---|---|
<dest>.part |
dest.with_suffix(dest.suffix + ".part") |
Partial data |
<dest>.part.ctrl |
<dest>.part + .ctrl |
Binary checkpoint |
<dest>.part.ctrl.tmp |
<dest>.part.ctrl + .tmp |
Ephemeral temp for atomic checkpoint writes |
For a destination data/2024-vol01.csv.gz, the files are
data/2024-vol01.csv.gz.part, data/2024-vol01.csv.gz.part.ctrl, and
(transiently) data/2024-vol01.csv.gz.part.ctrl.tmp.
Preflight path validation¶
Before making any HTTP request, haul() checks that the destination path plus
the longest sidecar suffix (.part.ctrl) fits within the filesystem's limits.
This catches two independent constraints:
- Filename component length —
NAME_MAXfor the target directory (255 bytes on ext4/APFS/XFS, 255 UTF-16 code units on NTFS). On macOS, filenames are measured in NFD-normalized UTF-8 bytes to match what APFS actually stores. - Full path length —
PATH_MAXfor the target filesystem. On Windows, paths beyond 250 characters automatically get the\\?\extended-length prefix, raising the effective limit to 32,767.
If either check fails, haul() raises DestinationError immediately — before
opening any network connection or creating any files.
Transport adapter architecture¶
haul() and haul_async() auto-detect supported client types and
wrap them internally via a registry of adapter factories. The adapter
protocol is deliberately minimal: a single stream_get() context manager.
This design means:
- Your session is borrowed, not owned. pyhaul never creates, configures, or closes sessions. Auth headers, proxy config, connection pools — everything passes through unchanged.
- Transport errors propagate unwrapped.
httpx.ReadTimeoutstayshttpx.ReadTimeout. You catch the types you already know. - Custom transports are easy. Implement
TransportSession(one method) and register it. See Writing a Custom Adapter.
For the supported client types and per-adapter notes, see HTTP Client Adapters.
Exception design¶
pyhaul's exception hierarchy separates retryable from non-retryable errors:
PartialHaulError— the stream ended early, but progress is saved. Retry.UnexpectedStatusError— server returned a non-download status (429, 503, 404, …). Checkexc.is_transientto decide whether to retry, and inspectexc.retry_afterfor server-requested backoff.ServerMisconfiguredError— the server violated HTTP in a way that makes safe resume impossible. Don't retry.ControlFileError— the checkpoint file is corrupt. Auto-recovers on next attempt.
Transport errors from the underlying HTTP library pass through unwrapped to preserve the caller's existing error-handling code.
For the full exception table, see Exceptions Reference.
Crash safety¶
The order of writes matters. pyhaul uses a strict three-phase sequence throughout:
- Data first. Write bytes to the
.partfile, then callfdatasyncto flush them to durable storage. - Checkpoint second. Serialize the new checkpoint, write it to
<dest>.part.ctrl.tmp,fsyncthe temp file, thenrenameit over<dest>.part.ctrl. Becauserenameis atomic on POSIX (andreplaceon Windows), the checkpoint on disk is always either the old version or the new version — never a half-written mix. - Finalization last. On completion, atomically rename
.partto the destination path, then delete.part.ctrl.
Why atomic rename¶
The alternative — writing directly to the destination file — leaves a
window where a crash produces a partial file at the final path.
Readers of that file have no way to distinguish "complete" from
"half-written." The rename syscall is the standard POSIX mechanism
for atomic file replacement: the kernel updates a single directory
entry, so the operation either fully succeeds or has no effect. There
is no intermediate state visible to other processes.
Why same-filesystem matters for atomicity¶
rename() is only atomic when source and destination are on the same
filesystem. A cross-filesystem "rename" is actually copy-then-delete,
which is neither atomic nor crash-safe. pyhaul ensures this by
constructing every sidecar file as a sibling of the destination —
the .part, .part.ctrl, and .part.ctrl.tmp files all live in the
same directory as the final destination. Since they share a directory,
they are guaranteed to share a filesystem, and every rename in the
pipeline is truly atomic.
This is also why pyhaul does not use tempfile.mkstemp() or the
system temp directory (/tmp, %TEMP%). Those may reside on a
different filesystem or partition than the destination, which would
break the atomicity guarantee on the final rename.
The guarantees¶
The payoff is a set of invariants you can verify by inspecting a directory with no other context:
- If the destination file exists at the requested path, it is complete and uncorrupted. There is no in-between state where a partially-written file sits at the final name.
- If
.partand.part.ctrlexist instead, the firstvalid_lengthbytes of.partare durable and correct. The rest is junk from preallocation or unflushed writes that gets trimmed on completion. A subsequenthaul()call against the same destination picks up from bytevalid_length.
Control file structure¶
The .part.ctrl file is a compact binary checkpoint. It packs everything
pyhaul needs to resume into a small, flat structure:
- 40-byte fixed header — magic bytes (
HAUL), format version, cursor position, block size, extent (total download size if known), and start offset. - CRC-protected TLV extensions — variable-length metadata: the server's ETag (for change detection on resume), server-reported total length (if known), and a tail hash (SHA-256 of the current partial block). Each TLV chunk includes a CRC32 to detect corruption in the checkpoint itself.
- Hash payload — a flat sequence of 32-byte SHA-256 digests, one per completed 8 MiB block.
The full binary format is specified in the Control File Spec.
Why block-level hashes¶
The naive approach to resume validation is: re-read the entire .part file,
hash it, and compare against a stored hash. For small files that's fine. For
a 50 GB file over a flaky satellite link where haul() gets called hundreds
of times, re-reading and re-hashing 50 GB on every resume attempt would
dominate the total download time.
Block-level hashing solves this. pyhaul hashes each 8 MiB block independently as bytes stream in, and stores the completed block hashes in the control file. On resume, only the last partial block (at most 8 MiB) needs to be re-read and verified against its stored tail hash. All previously completed blocks are trusted via their stored digests. This makes resume validation O(block_size), not O(file_size).
On completion, the final file hash is computed as
SHA-256(concatenated block hashes), formatted as <hex>-<count>. This is
a tree hash — the digest of digests — not a flat hash of the file content.
It can be computed incrementally without ever holding the entire file in
memory or re-reading it from disk.
Control file size overhead¶
Even with a hash per block, the control file stays small. The 8 MiB block size means one 32-byte hash per 8 MiB of download data — a ratio of roughly 0.0004%:
| Part file | Blocks | Ctrl file size |
|---|---|---|
| 100 MB | 12 | 474 B |
| 1 GB | 120 | 3.8 KiB |
| 10 GB | 1,193 | 37 KiB |
| 100 GB | 11,921 | 373 KiB |
| 1 TB | 119,210 | 3.6 MiB |
A 100 GB download produces a checkpoint under 400 KiB. The control file is negligible relative to the data it describes.
Why SHA-256 and not BLAKE3¶
A tree hash like BLAKE3 would be a more natural fit here — BLAKE3 is inherently a Merkle tree, so block-level incremental hashing is built into the algorithm rather than layered on top. It would also be faster (BLAKE3 uses SIMD and parallelism internally).
However, pyhaul is a zero-dependency pure-Python library. hashlib.sha256
is in the standard library on every Python installation; BLAKE3 would require
a compiled C/Rust extension (blake3 on PyPI). Since the hashing overhead is
small relative to network I/O and disk writes, SHA-256's performance is
adequate, and the manual block-level tree construction achieves the same
incremental-verification goal that BLAKE3 would provide natively.