fix(sovereign-ci): per-PR CARGO_TARGET_DIR for test/lint/coverage/bench jobs#31
Merged
Conversation
…ch jobs
Default /__w/<repo>/<repo>/target/ is shared across concurrent PR builds
on the same self-hosted runner; cargo's fingerprint directory is not
concurrent-safe and fails stochastically with "No such file or directory"
errors. 15 consecutive collisions observed on aprender PR #1019 today.
The aprender repo's own workspace-test job (ci.yml, task #134) already
mitigates the same pathology by mounting a per-PR target dir; port the
pattern into the fleet-reusable ci / test / lint / coverage / bench jobs
so every repo consuming sovereign-ci.yml gets the same isolation.
Volume: /mnt/nvme-raid0/targets/sovereign-ci-<repo>/<pr-or-branch>:/workspace/target
Env: CARGO_TARGET_DIR=/workspace/target (per job, not workflow-level,
to avoid leaking into the non-container security/provenance jobs).
Security job unchanged (no cargo build, no target dir needed).
Provenance job unchanged (GitHub-hosted, no container).
Gate job unchanged (no build).
Addresses: paiml/infra#75
Related: aprender task #163, aprender task #134, aprender PR #1019
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
to paiml/aprender
that referenced
this pull request
May 15, 2026
…ate race) (#1693) ## Root cause (five-whys) 1. CI fails with `couldn't create a temp dir: /workspace/target/debug/deps/...` 2. `deps/` was unlinked mid-build, then cargo tried to create a tempfile inside it 3. A concurrent process on the same host path is racing this run's cargo 4. `concurrency.cancel-in-progress: true` (ci.yml:22) cancels the old run when `update-branch` pushes a new commit — but cancellation is signal-based with a 30s SIGTERM→SIGKILL window. During those 30s the old cargo is still writing/cleaning files in `/workspace/target/...` 5. The mount was per-PR (`aprender-ci/${PR_OR_REF}`), so the new run mounted THE SAME host directory as the dying old run — they shared `/workspace/target/debug/deps/` on the host ## What the prior fix solved vs. what it introduced The 2026-04-23 fleet fix (paiml/.github#31) moved sovereign-ci from a shared runner-wide target dir to per-PR isolation, solving CROSS-PR same-runner collisions. That worked. But the per-PR isolation INTRODUCED this cancel-corrupt-state collision for SAME-PR sequential runs (every `update-branch` or new push triggers a new run that mounts the SAME host path). ## Fix Bump the mount path one level deeper: `aprender-ci/${PR_OR_REF}` → `aprender-ci/${PR_OR_REF}/run-${GITHUB_RUN_ID}` Now every CI run gets its own isolated target dir; no two cargo invocations ever share a host directory. ## Trade-offs - **sccache**: unchanged. Lives on its own mount (`/home/noah/data/sccache`) and continues to dedupe across ALL runs of ALL PRs. This is the heavy lifter — typical 80%+ hit rate on warm cache. - **cargo-incremental**: lost per new run. Cost is small because sccache already covers most of the rebuild surface; cargo-incremental is the per-crate metadata, not the codegen. - **Disk**: per-run dirs accumulate under `aprender-ci/<PR>/`. Existing disk-guard hook deletes old PR dirs after merge (incl. all run subdirs). No new cleanup logic needed. ## Verified - `python3 -c "import yaml; yaml.safe_load(open('.github/workflows/ci.yml'))"` → OK - 4 mount lines updated (workspace-test: 3 docker steps + ownership-fix step) - Inline comment block documents the cancel-corrupt-state race for future maintainers ## Impact Once this lands, the recurring "No such file or directory (os error 2)" flakes hitting the queue should stop. Currently 4+ PRs blocked at the same defect simultaneously. Refs Toyota Way andon: this is the second iteration of the per-PR-target fix; the 2026-04-23 fix was correct in direction (isolate from runner shared state) but didn't account for the cancel-in-progress race. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Port the per-PR
CARGO_TARGET_DIRisolation pattern (already proven in aprender'sworkspace-testjob) into the fleet-wide reusablesovereign-ci.yml, so every repo that consumes this workflow gets concurrency-safe builds on shared self-hosted runners.Closes paiml/infra#75.
Problem
The reusable
sovereign-ci.ymlci / test,ci / lint,ci / coverage, andci / benchjobs all defaultCARGO_TARGET_DIRto/__w/<repo>/<repo>/target/. When two or more PR builds for the same repo land on the same self-hosted runner at overlapping times, they write to the same target directory. Cargo's fingerprint directory is not concurrent-safe, producing cascadingNo such file or directoryerrors for half-written.rmeta/.rlibartifacts:15 consecutive collisions were observed on aprender PR #1019 over 2026-04-22 → 2026-04-23, wedging SHIP-007 PARTIAL-discharge work behind infrastructure flakes.
Fix
Mirror aprender's
workspace-testpattern (already proven in production — aprender task #134) into the 4 container-based jobs of the reusable workflow:/mnt/nvme-raid0/targets/sovereign-ci-${{ inputs.repo }}/${{ github.event.pull_request.number || github.ref_name }}:/workspace/targetenv.CARGO_TARGET_DIR: /workspace/targetEach PR gets its own fingerprint directory. Zero contention. No change to build product or cache behavior.
Jobs touched
test— full rationale comment inline (this is the canonical instance)lint— short reference comment (# paiml/infra#75: per-PR CARGO_TARGET_DIR isolation — see test job.)coverage— short reference commentbench— short reference commentJobs NOT touched (intentional)
security(cargo-audit) — no build, no/workspace/targetusageprovenance— runs onubuntu-latest(GitHub-hosted), no self-hosted runner contentiongate— no buildValidation
python3 -c "import yaml; yaml.safe_load(open('.github/workflows/sovereign-ci.yml'))"→ valid.${{ inputs.repo }},${{ github.event.pull_request.number }},${{ github.ref_name }}) which resolve before container start — safe incontainer.volumes:..github/workflows/ci.yml::workspace-test(lines 52-56).Test plan
ci / testwith 2+ concurrent PR builds on the same runner./mnt/nvme-raid0/targets/sovereign-ci-aprender/<pr>/directory. NoNo such file or directoryerrors.🤖 Generated with Claude Code