Fix S3 cache race that breaks restores with "did not match expected ETag"#153
Open
vivster7 wants to merge 1 commit into
Open
Fix S3 cache race that breaks restores with "did not match expected ETag"#153vivster7 wants to merge 1 commit into
vivster7 wants to merge 1 commit into
Conversation
…Tag" Cache keys are content-addressed, so parallel jobs running the same step compute the same S3 key. The save path overwrote that object non-atomically (`exists` check then `aws s3 cp`), and an overwrite landing during another job's multipart download changed the object's ETag, aborting the download with "Contents of stored object ... did not match expected ETag." - save_cache: store single objects with a conditional create (`s3api put-object --if-none-match '*'`) so the first writer wins and the object is never overwritten. PreconditionFailed is treated as success. Falls back to a normal copy when conditional writes are unavailable or the object is too large for a single PUT; `force` still overwrites. - restore_cache: retry the download on the transient ETag mismatch (BUILDKITE_PLUGIN_S3_CACHE_DOWNLOAD_RETRIES, default 3). Content-addressed keys mean the retried download returns identical contents. Adds tests for conditional save, concurrent-writer skip, force overwrite, fallback to copy, and restore retry/exhaustion. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
With the
s3backend, two agents running the same step in parallel can fail a cache restore with:Root cause is a race between a restore and a concurrent re-save of the same S3 key:
lib/shared.bashhashes the manifest), so parallel jobs compute an identical key.exists(alist-objects-v2) thenaws s3 cp. Two jobs can both see "not exists" and both upload — last write wins.aws s3 cp(botocores3transfer) HEADs the object, records its ETag, then issues ranged GETs withIfMatch=<etag>. An overwrite landing mid-download makes S3 return412 PreconditionFailed, which the CLI surfaces as the error above. This only affects objects pastmultipart_threshold(8 MB), i.e. large dependency caches.I reproduced this end-to-end against MinIO + AWS CLI v2.31.38 (deterministic stale-
If-Match→412, and a throttledcpoverwritten mid-download producing the exact error string).Fix
save_cache: store single objects with a conditional create —aws s3api put-object --if-none-match '*'. The first writer wins; later writers getPreconditionFailed, treated as success. The object is never overwritten, so its ETag stays stable. Falls back to a normal copy when conditional writes aren't available (older CLI / S3-compatible endpoint) or the object exceeds the single-PUTlimit.forcestill overwrites.restore_cache: retry the download on the transientdid not match expected ETagfailure, configurable viaBUILDKITE_PLUGIN_S3_CACHE_DOWNLOAD_RETRIES(default3). Content-addressed keys guarantee the retried download returns identical contents.Tests
tests/cache_s3.bats(all 16 pass viabuildkite/plugin-tester,shellcheck -xclean):put-object --if-none-matchforceoverwrites instead of conditional createPreconditionFailed) is treated as successBackwards compatibility
No config changes required. Folder caches keep using
sync;forcebehaviour is preserved; the only new (optional) knob isBUILDKITE_PLUGIN_S3_CACHE_DOWNLOAD_RETRIES.Made with Cursor