Skip to content

Fix S3 cache race that breaks restores with "did not match expected ETag"#153

Open
vivster7 wants to merge 1 commit into
buildkite-plugins:masterfrom
vivster7:fix/s3-cache-etag-race
Open

Fix S3 cache race that breaks restores with "did not match expected ETag"#153
vivster7 wants to merge 1 commit into
buildkite-plugins:masterfrom
vivster7:fix/s3-cache-etag-race

Conversation

@vivster7
Copy link
Copy Markdown

Problem

With the s3 backend, two agents running the same step in parallel can fail a cache restore with:

download failed: s3://<bucket>/<key> ... Contents of stored object "<key>" in bucket "<bucket>" did not match expected ETag.

Root cause is a race between a restore and a concurrent re-save of the same S3 key:

  • Keys are content-addressed (lib/shared.bash hashes the manifest), so parallel jobs compute an identical key.
  • Save overwrites non-atomically: exists (a list-objects-v2) then aws s3 cp. Two jobs can both see "not exists" and both upload — last write wins.
  • Restore validates against a stale ETag: aws s3 cp (botocore s3transfer) HEADs the object, records its ETag, then issues ranged GETs with IfMatch=<etag>. An overwrite landing mid-download makes S3 return 412 PreconditionFailed, which the CLI surfaces as the error above. This only affects objects past multipart_threshold (8 MB), i.e. large dependency caches.

I reproduced this end-to-end against MinIO + AWS CLI v2.31.38 (deterministic stale-If-Match412, and a throttled cp overwritten mid-download producing the exact error string).

Fix

  • save_cache: store single objects with a conditional create — aws s3api put-object --if-none-match '*'. The first writer wins; later writers get PreconditionFailed, treated as success. The object is never overwritten, so its ETag stays stable. Falls back to a normal copy when conditional writes aren't available (older CLI / S3-compatible endpoint) or the object exceeds the single-PUT limit. force still overwrites.
  • restore_cache: retry the download on the transient did not match expected ETag failure, configurable via BUILDKITE_PLUGIN_S3_CACHE_DOWNLOAD_RETRIES (default 3). Content-addressed keys guarantee the retried download returns identical contents.

Tests

tests/cache_s3.bats (all 16 pass via buildkite/plugin-tester, shellcheck -x clean):

  • conditional single-file save uses put-object --if-none-match
  • force overwrites instead of conditional create
  • concurrent writer (PreconditionFailed) is treated as success
  • fallback to copy when conditional create is unsupported
  • restore retries then succeeds; restore fails after exhausting retries

Backwards compatibility

No config changes required. Folder caches keep using sync; force behaviour is preserved; the only new (optional) knob is BUILDKITE_PLUGIN_S3_CACHE_DOWNLOAD_RETRIES.

Made with Cursor

…Tag"

Cache keys are content-addressed, so parallel jobs running the same step
compute the same S3 key. The save path overwrote that object non-atomically
(`exists` check then `aws s3 cp`), and an overwrite landing during another
job's multipart download changed the object's ETag, aborting the download
with "Contents of stored object ... did not match expected ETag."

- save_cache: store single objects with a conditional create
  (`s3api put-object --if-none-match '*'`) so the first writer wins and the
  object is never overwritten. PreconditionFailed is treated as success.
  Falls back to a normal copy when conditional writes are unavailable or the
  object is too large for a single PUT; `force` still overwrites.
- restore_cache: retry the download on the transient ETag mismatch
  (BUILDKITE_PLUGIN_S3_CACHE_DOWNLOAD_RETRIES, default 3). Content-addressed
  keys mean the retried download returns identical contents.

Adds tests for conditional save, concurrent-writer skip, force overwrite,
fallback to copy, and restore retry/exhaustion.

Co-authored-by: Cursor <cursoragent@cursor.com>
@vivster7 vivster7 requested a review from a team as a code owner May 29, 2026 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant