Stream POST request in order to handle large files#161
Conversation
The Multipart encoder helps requests to upload large files without the need to read the entire file in memory
|
Did you try compressing the file? I feel like streaming is okay but I think really the issue is that disk images are really huge. |
|
Even when compressing asset, it is really common to upload very large files. It could be .ova or android disk image for example. Storing large file into memory just to send a HTTP request is not ok. It's a known limitation of the requests module and it's a shame we need to use another module to achieve what should be a normal behavior. (Sorry if I made grammar mistakes it's 2am in France) |
|
My point is more that instead of switching out the behavior in ctfcli, it probably would have been better to just compress your file. While I am roughly okay with the PR and using streaming, I am not sure if ctfcli should support behavior like this. No one really wants to download a 16 GB file. |
|
I understand but sometimes we have simply no choice 😅 For example, this is a CTF we organize each year. The last edition, we got files up to 2.5GB. During deployments this caused big spikes of RAM usage in order to send the file, and sometimes it caused OOM errors. Forensic challenges can be really big.16GB was the largest archive I've ever seen (I was a disk dump from a Windows server if I remember correctly) |
|
This would help me too, I often have compressed forensics artefacts in the 1-5GB range. I upload to CTFd Cloud using beefy CI runners which don't crash, but synchronous reading into memory makes the upload pretty slow. My other workaround was manually uploading to external blob storage, then linking from CTFd |
|
I ran into the same issue. |
|
I think that's a stronger rationale and I think in truth limits need to truly enforced versus how they are ultimately hit. I will fix the merge conflicts and accept. |
There was a problem hiding this comment.
Pull request overview
This PR replaces requests' built-in multipart encoding with requests_toolbelt.MultipartEncoder in ctfcli.core.api.API.request, so that large file uploads (e.g. multi-GB challenge artifacts) are streamed instead of being buffered in memory to compute Content-Length. The new encoder takes fields in a different shape, so callers must now pass each file as (filename, fileobj) tuples, and int form values are coerced to strings since MultipartEncoder rejects non-string values.
Changes:
- Rewrite
API.requestto detect form-data calls and route them throughMultipartEncoder, settingContent-Typefrom the encoder. - Update file-upload call sites (
_create_file,_create_all_files) to the new(filename, fileobj)value shape. - Add
requests-toolbelt==1.0.0dependency and regenerateuv.lock.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| ctfcli/core/api.py | Builds multipart body via MultipartEncoder for streaming uploads |
| ctfcli/core/challenge.py | Updates _create_file/_create_all_files to the new files= shape |
| pyproject.toml | Adds requests-toolbelt==1.0.0 dependency |
| uv.lock | Regenerated lockfile (revision bump and upload_time fields) |
Note: _create_solution in ctfcli/core/challenge.py (around line 587–594) still passes ("file", open(...)) to files=, which is incompatible with the new MultipartEncoder contract; this is outside the diff regions so a comment could not be attached there, but it should be fixed in the same PR.
Comments suppressed due to low confidence (1)
ctfcli/core/api.py:67
- Only
dictis handled here, but the outer guard on line 58 also admits anyMapping. A non-dictMapping(e.g. anOrderedDictsubclass with a custom type, or another mapping) would skip the value-stringification path andfieldswould be left empty. Useisinstance(data, Mapping)for consistency with the outer check.
if isinstance(data, dict):
# int are not allowed as value in MultipartEncoder
fields = list(map(lambda v: (v[0], str(v[1]) if isinstance(v[1], int) else v[1]), data.items()))
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if isinstance(data, dict): | ||
| # int are not allowed as value in MultipartEncoder | ||
| fields = list(map(lambda v: (v[0], str(v[1]) if isinstance(v[1], int) else v[1]), data.items())) | ||
|
|
||
| if files is not None: | ||
| if isinstance(files, dict): |
| # int are not allowed as value in MultipartEncoder | ||
| fields = list(map(lambda v: (v[0], str(v[1]) if isinstance(v[1], int) else v[1]), data.items())) |
|
|
||
| return super(API, self).request( | ||
| method, | ||
| url, | ||
| data=multipart, | ||
| headers={"Content-Type": multipart.content_type}, |
| fields = list() | ||
| if isinstance(data, dict): | ||
| # int are not allowed as value in MultipartEncoder | ||
| fields = list(map(lambda v: (v[0], str(v[1]) if isinstance(v[1], int) else v[1]), data.items())) | ||
|
|
||
| if files is not None: | ||
| if isinstance(files, dict): | ||
| files = list(files.items()) | ||
| fields.extend(files) # type: ignore | ||
|
|
||
| multipart = MultipartEncoder(fields) | ||
|
|
||
| return super(API, self).request( | ||
| method, | ||
| url, | ||
| data=multipart, | ||
| headers={"Content-Type": multipart.content_type}, |
The bug
I ran into a problem last year: when I tried to create or synchronize a challenge containing a large file (i.e. a forensics challenge with a 15 GB disk image), the entire file was put into memory before starting the request
This causes crashes since I only have 16GB of RAM in my computer.
The cause
Although the
requestsmodule supports body streaming when you pass a file pointer to thedataparameter, it is not capable of streaming form-data.When the
requestsmodule prepares the headers, it tries to calculate theContent-Length. As a result, the entire body will be stored in memory.The fix
One solution would be to switch to another HTTP client, capable of streaming form-data.
I chose to modify as little code as possible. I made the choice to delegate the body encoding to the
MultipartEncoderfrom therequests-toolbeltmodule. This requires a few modifications to theAPIclass, since theMultipartEncodertakes parameters differently fromrequests.As a result files must be sent with a filename hint: