Skip to content

fix(s3): bound concurrent multipart part uploads with a global semaphore#66

Merged
jaredLunde merged 1 commit into
mainfrom
jared/s3-fixes
Jun 2, 2026
Merged

fix(s3): bound concurrent multipart part uploads with a global semaphore#66
jaredLunde merged 1 commit into
mainfrom
jared/s3-fixes

Conversation

@jaredLunde
Copy link
Copy Markdown
Contributor

Problem

object_store's WriteMultipart spawns one upload task per part with no concurrency limit, and glidefs acquired the max_s3_uploads semaphore per pack, not per part. So a large pack fanned out into ~20 simultaneous part PUTs, and across concurrent packs × 64 exports this opened 450+ connections — saturating the uplink into bufferbloat (~600 ms RTT), 12–15% packet loss, and TCP congestion collapse. Uploads then exceeded the 300 s timeout and failed in a loop. max_s3_uploads looked like "128" but the real connection count was 128 × ~20.

Fix

Replace WriteMultipart with BoundedMultipart: every part PUT (and every single-PUT small pack) acquires one permit from the shared upload semaphore and holds it until that request completes. max_s3_uploads now means exactly "max simultaneous outbound S3 write requests," globally — independent of how many exports/packs/parts exist. Bounded memory (≤ limit parts in flight) and natural backpressure (a slow part holds its permit) come for free.

This is the correct invariant: the limit is enforced at the granularity of the actual network operation (the part PUT), so the configured number maps directly to connections with no multiplicative surprise.

Tests

  • stream_pack_bounds_global_inflight_part_uploads — 3 concurrent packs (18 parts of demand) sharing a 4-permit semaphore peak at exactly 4 concurrent part PUTs. Asserts == LIMIT: the bound is both never exceeded and actually reached (not over-throttling). Unbounded this would be 18.
  • bounded_multipart_round_trips_pack_bytes — a multipart-streamed pack (256-byte parts, real InMemory upload) is byte-identical to the canonical assembled pack; index parses and data round-trips.
  • 9 existing integration tests (compaction / cooldown / flush / crash-midway / abort-orphan) pass through the new upload path.

🤖 Generated with Claude Code

object_store's WriteMultipart spawns one upload task per part with no
concurrency limit, and glidefs acquired the upload semaphore per *pack* rather
than per *part*. A large pack (up to the chunk size) fanned out into ~20
simultaneous part PUTs; across concurrent packs and 64 exports this opened
hundreds of connections (observed 450+), saturating the uplink into bufferbloat
(~600ms RTT), 12-15% packet loss, and TCP congestion collapse — uploads then
exceeded the 300s timeout and failed in a loop.

Replace WriteMultipart with BoundedMultipart: every part PUT (and every
single-PUT small pack) acquires one permit from the shared upload semaphore and
holds it until that request completes. `max_s3_uploads` now means exactly "max
simultaneous outbound S3 write requests" — globally, independent of how many
exports/packs/parts exist. Bounded memory (<= limit parts buffered/in-flight)
and natural backpressure (a slow part holds its permit) fall out for free.

Tests:
- stream_pack_bounds_global_inflight_part_uploads: 3 concurrent packs sharing a
  4-permit semaphore peak at *exactly* 4 concurrent part PUTs (18 unbounded) —
  verifies the bound is both never exceeded and actually reached.
- bounded_multipart_round_trips_pack_bytes: a multipart-streamed pack (tiny
  parts, real InMemory upload) is byte-identical to the canonical assembled
  pack, with a parseable index and round-tripping data.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jaredLunde jaredLunde merged commit 1d18e7a into main Jun 2, 2026
24 checks passed
@jaredLunde jaredLunde deleted the jared/s3-fixes branch June 2, 2026 04:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant