Skip to content

feat: add streaming download support for daily extract files #160

@be-ant

Description

@be-ant

Problem

get_daily_extract_file in core.py:526-534 calls requests.get() without stream=True. The wrap_return decorator's FILE_RESPONSE path (common.py:64-67) then returns response.content, loading the entire file into memory before returning it.

Impact

Daily extract ZIPs can be 50+ MB. In memory-constrained environments (AWS Lambda at 128–512 MB), this wastes memory and risks OOM errors — especially when multiple extracts are processed concurrently.

POC Evidence

A streaming download of a 53.77 MB daily extract file was benchmarked against the current buffered approach:

Approach Peak Memory
Current (response.content) ~108 MB
Streaming (stream=True + iter_content) ~55 KB

Both requests (stream=True) and httpx streaming produced identical byte counts. The OFS endpoint includes a Content-Length header, so progress tracking is also possible.

Proposed Solution

Add a new method get_daily_extract_file_stream(date, filename, chunk_size=8192) that:

  1. Calls requests.get(..., stream=True) (bypassing or extending the wrap_return decorator)
  2. Returns an iterator (or the response object as a context manager) yielding chunks via iter_content(chunk_size)
  3. May require a new response type (e.g., STREAM_RESPONSE) in the wrap_return decorator, or can bypass it entirely for this method

Example usage (desired API)

with ofsc.get_daily_extract_file_stream(date="2024-01-15", filename="extract.zip") as stream:
    with open("extract.zip", "wb") as f:
        for chunk in stream.iter_content(chunk_size=8192):
            f.write(chunk)

Bonus: additionalHeaders Bug

While reviewing the implementation, a bug was found in _base.py:220-223:

# line 221 merges headers correctly...
headers = {**self.headers, **additionalHeaders}
# ...but line 223 overwrites `headers` with self.headers, dropping the merge
headers = self.headers  # BUG: discards additionalHeaders

This means additionalHeaders passed by callers is silently ignored. Worth fixing alongside the streaming work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions