Bulk Downloads¶
This guide covers the most common real-world use case: downloading many files, possibly in parallel, from a script or service.
The core mental model¶
Each haul() call downloads one file with one HTTP request. pyhaul does not
manage parallelism, retry policy, or queueing — those are yours. What pyhaul
gives you:
- Atomic completion. The destination file does not exist until the download is fully complete. There is no window where a partial file sits at the final path.
- Crash-safe resume. If the process is killed mid-download, sidecar files
(
.partand.part.ctrl) persist on disk. The nexthaul()call to the same destination picks up from the last durable byte. - Independent checkpoints. Each destination has its own checkpoint. Ten downloads in flight means ten independent resume states. A crash partway through leaves each file individually resumable.
When can you safely access the destination file?¶
Only after haul() returns CompleteHaul.
Until that point, the file at the destination path does not exist. In-progress
data lives at <dest>.part. On completion, pyhaul atomically renames .part
to the final path. If haul() raises — for any reason — the destination file
is not created.
result = haul(url, client, dest="data.bin")
# At this point — and only at this point — data.bin exists and is complete.
# result.sha256 provides the integrity hash.
Warning
Never read <dest>.part directly. Its contents may include unflushed data
or pre-allocated junk past the valid cursor. The .part.ctrl checkpoint
knows the true valid length; the engine handles this on resume.
Parallel downloads with threads (sync)¶
For sync clients (requests, httpx.Client, niquests.Session, urllib3),
use a thread pool. Each thread gets its own session or shares a thread-safe
session:
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
import httpx
from pyhaul import haul, PartialHaulError
FILES = [
("https://data.example.edu/census/2024-vol01.csv.gz", Path("data/2024-vol01.csv.gz")),
("https://data.example.edu/census/2024-vol02.csv.gz", Path("data/2024-vol02.csv.gz")),
("https://data.example.edu/census/2024-vol03.csv.gz", Path("data/2024-vol03.csv.gz")),
]
MAX_RETRIES = 10
def download_with_retry(client: httpx.Client, url: str, dest: Path) -> Path:
for attempt in range(1, MAX_RETRIES + 1):
try:
haul(url, client, dest=dest)
return dest
except PartialHaulError:
if attempt == MAX_RETRIES:
raise
return dest # unreachable, but satisfies type checkers
Path("data").mkdir(exist_ok=True)
with httpx.Client() as client:
with ThreadPoolExecutor(max_workers=4) as pool:
futures = {
pool.submit(download_with_retry, client, url, dest): url
for url, dest in FILES
}
for future in as_completed(futures):
url = futures[future]
try:
path = future.result()
print(f"done: {path}")
except Exception as exc:
print(f"failed: {url}: {exc}")
Note
requests.Session is the only sync client that is not thread-safe.
Create one session per thread, or switch to niquests.Session, httpx.Client,
or urllib3.PoolManager — all of which are thread-safe.
Parallel downloads with asyncio¶
For async clients (httpx.AsyncClient, aiohttp.ClientSession,
niquests.AsyncSession), use asyncio.TaskGroup or asyncio.gather:
import asyncio
from pathlib import Path
import httpx
from pyhaul import haul_async, PartialHaulError
FILES = [
("https://data.example.edu/census/2024-vol01.csv.gz", Path("data/2024-vol01.csv.gz")),
("https://data.example.edu/census/2024-vol02.csv.gz", Path("data/2024-vol02.csv.gz")),
("https://data.example.edu/census/2024-vol03.csv.gz", Path("data/2024-vol03.csv.gz")),
]
async def download_one(
client: httpx.AsyncClient, url: str, dest: Path
) -> Path:
for attempt in range(1, 11):
try:
await haul_async(url, client, dest=dest)
return dest
except PartialHaulError:
if attempt == 10:
raise
await asyncio.sleep(min(2**attempt, 30))
return dest
async def main() -> None:
Path("data").mkdir(exist_ok=True)
async with httpx.AsyncClient() as client:
async with asyncio.TaskGroup() as tg:
tasks = [
tg.create_task(download_one(client, url, dest))
for url, dest in FILES
]
for task in tasks:
print(f"done: {task.result()}")
asyncio.run(main())
Limiting concurrency¶
If you're downloading hundreds of files, limit concurrency with a semaphore:
sem = asyncio.Semaphore(8)
async def download_one(client, url, dest):
async with sem:
for attempt in range(1, 11):
try:
await haul_async(url, client, dest=dest)
return dest
except PartialHaulError:
if attempt == 10:
raise
await asyncio.sleep(min(2**attempt, 30))
return dest
What happens on interruption¶
When your script is killed (SIGINT, SIGTERM, OOM, power loss):
- Any in-flight
haul()calls leave their.partand.part.ctrlfiles on disk. - Destination files that were not yet complete do not exist at their final paths.
- Destination files from previously completed downloads are unaffected.
- Re-running the same script resumes each incomplete download from its checkpoint.
You do not need to track which files completed. Just iterate your download list
again — haul() will skip already-complete files (if the destination exists)
or resume from the checkpoint.
for url, dest in FILES:
if dest.exists():
continue # already complete from a previous run
download_with_retry(client, url, dest)
The .part and .part.ctrl relationship¶
Resume only works when both sidecar files are present. The .part.ctrl
file is the checkpoint — it records the cursor position, ETag, and block-level
hashes that pyhaul needs to verify the .part data before appending. Without
it, pyhaul has no way to trust the bytes already on disk.
If the .part file exists without its .part.ctrl:
pyhaul treats this as a fresh download. It overwrites the orphaned .part from
byte 0 — no resume, no error. The data in the old .part is discarded.
This can happen if the .part.ctrl file is accidentally deleted, or if a
.part file was left behind by a different tool or a previous version of
your script.
If a .part file is left behind by something other than pyhaul:
pyhaul will overwrite it from the beginning (since there's no matching
checkpoint). This is safe — the .part file is a scratch workspace, not a
final artifact. The no-clobber guarantee applies only to the destination
file, which is never written to until the atomic rename at completion.
Warning
If you are managing downloads across process restarts, do not delete
.part.ctrl files without also deleting their .part companions. An
orphaned .part wastes disk space until pyhaul (or your cleanup code)
overwrites or removes it.
Cleaning up failed downloads¶
To discard a partial download and force a restart, delete both sidecar files:
from pathlib import Path
dest = Path("data.bin")
dest.with_suffix(dest.suffix + ".part.ctrl").unlink(missing_ok=True) # checkpoint first
dest.with_suffix(dest.suffix + ".part").unlink(missing_ok=True)
Summary¶
| Question | Answer |
|---|---|
| When does the destination file exist? | Only after haul() returns CompleteHaul |
| What files are on disk during download? | <dest>.part (data) and <dest>.part.ctrl (checkpoint) |
| What happens on crash/kill? | Sidecar files persist; next haul() resumes |
.part exists without .part.ctrl? |
pyhaul starts over from byte 0 (no resume possible) |
Is it safe to read .part directly? |
No — use the completed destination file only |
| How do I force a fresh download? | Delete both .part and .part.ctrl |