fix(s3): bound concurrent multipart part uploads with a global semaphore#66
Merged
Conversation
object_store's WriteMultipart spawns one upload task per part with no concurrency limit, and glidefs acquired the upload semaphore per *pack* rather than per *part*. A large pack (up to the chunk size) fanned out into ~20 simultaneous part PUTs; across concurrent packs and 64 exports this opened hundreds of connections (observed 450+), saturating the uplink into bufferbloat (~600ms RTT), 12-15% packet loss, and TCP congestion collapse — uploads then exceeded the 300s timeout and failed in a loop. Replace WriteMultipart with BoundedMultipart: every part PUT (and every single-PUT small pack) acquires one permit from the shared upload semaphore and holds it until that request completes. `max_s3_uploads` now means exactly "max simultaneous outbound S3 write requests" — globally, independent of how many exports/packs/parts exist. Bounded memory (<= limit parts buffered/in-flight) and natural backpressure (a slow part holds its permit) fall out for free. Tests: - stream_pack_bounds_global_inflight_part_uploads: 3 concurrent packs sharing a 4-permit semaphore peak at *exactly* 4 concurrent part PUTs (18 unbounded) — verifies the bound is both never exceeded and actually reached. - bounded_multipart_round_trips_pack_bytes: a multipart-streamed pack (tiny parts, real InMemory upload) is byte-identical to the canonical assembled pack, with a parseable index and round-tripping data. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
object_store'sWriteMultipartspawns one upload task per part with no concurrency limit, and glidefs acquired themax_s3_uploadssemaphore per pack, not per part. So a large pack fanned out into ~20 simultaneous part PUTs, and across concurrent packs × 64 exports this opened 450+ connections — saturating the uplink into bufferbloat (~600 ms RTT), 12–15% packet loss, and TCP congestion collapse. Uploads then exceeded the 300 s timeout and failed in a loop.max_s3_uploadslooked like "128" but the real connection count was 128 × ~20.Fix
Replace
WriteMultipartwithBoundedMultipart: every part PUT (and every single-PUT small pack) acquires one permit from the shared upload semaphore and holds it until that request completes.max_s3_uploadsnow means exactly "max simultaneous outbound S3 write requests," globally — independent of how many exports/packs/parts exist. Bounded memory (≤ limit parts in flight) and natural backpressure (a slow part holds its permit) come for free.This is the correct invariant: the limit is enforced at the granularity of the actual network operation (the part PUT), so the configured number maps directly to connections with no multiplicative surprise.
Tests
stream_pack_bounds_global_inflight_part_uploads— 3 concurrent packs (18 parts of demand) sharing a 4-permit semaphore peak at exactly 4 concurrent part PUTs. Asserts== LIMIT: the bound is both never exceeded and actually reached (not over-throttling). Unbounded this would be 18.bounded_multipart_round_trips_pack_bytes— a multipart-streamed pack (256-byte parts, realInMemoryupload) is byte-identical to the canonical assembled pack; index parses and data round-trips.🤖 Generated with Claude Code