fix(sovereign-ci): per-PR CARGO_TARGET_DIR for test/lint/coverage/bench jobs by noahgift · Pull Request #31 · paiml/.github

noahgift · 2026-04-23T07:23:07Z

Summary

Port the per-PR CARGO_TARGET_DIR isolation pattern (already proven in aprender's workspace-test job) into the fleet-wide reusable sovereign-ci.yml, so every repo that consumes this workflow gets concurrency-safe builds on shared self-hosted runners.

Closes paiml/infra#75.

Problem

The reusable sovereign-ci.yml ci / test, ci / lint, ci / coverage, and ci / bench jobs all default CARGO_TARGET_DIR to /__w/<repo>/<repo>/target/. When two or more PR builds for the same repo land on the same self-hosted runner at overlapping times, they write to the same target directory. Cargo's fingerprint directory is not concurrent-safe, producing cascading No such file or directory errors for half-written .rmeta/.rlib artifacts:

error: could not read dependency information for `nix` (/__w/aprender/aprender/target/debug/deps/libnix-....rmeta)
error: could not read dependency information for `wgpu_core` (/__w/aprender/aprender/target/debug/deps/libwgpu_core-....rmeta)
...

15 consecutive collisions were observed on aprender PR #1019 over 2026-04-22 → 2026-04-23, wedging SHIP-007 PARTIAL-discharge work behind infrastructure flakes.

Fix

Mirror aprender's workspace-test pattern (already proven in production — aprender task #134) into the 4 container-based jobs of the reusable workflow:

Mount a per-PR (or per-branch, for push-to-main) target directory: /mnt/nvme-raid0/targets/sovereign-ci-${{ inputs.repo }}/${{ github.event.pull_request.number || github.ref_name }}:/workspace/target
Set job-level env.CARGO_TARGET_DIR: /workspace/target

Each PR gets its own fingerprint directory. Zero contention. No change to build product or cache behavior.

Jobs touched

test — full rationale comment inline (this is the canonical instance)
lint — short reference comment (# paiml/infra#75: per-PR CARGO_TARGET_DIR isolation — see test job.)
coverage — short reference comment
bench — short reference comment

Jobs NOT touched (intentional)

security (cargo-audit) — no build, no /workspace/target usage
provenance — runs on ubuntu-latest (GitHub-hosted), no self-hosted runner contention
gate — no build

Validation

python3 -c "import yaml; yaml.safe_load(open('.github/workflows/sovereign-ci.yml'))" → valid.
Volume path interpolation uses workflow-config-time expressions (${{ inputs.repo }}, ${{ github.event.pull_request.number }}, ${{ github.ref_name }}) which resolve before container start — safe in container.volumes:.
Mirrors an already-production pattern at aprender .github/workflows/ci.yml::workspace-test (lines 52-56).

Test plan

Merge this PR and let aprender rerun PR #1019 ci / test with 2+ concurrent PR builds on the same runner.
Expect: each PR writes to its own /mnt/nvme-raid0/targets/sovereign-ci-aprender/<pr>/ directory. No No such file or directory errors.
Confirm consumers in the fleet that share paiml/.github still build (spot-check: realizar, trueno, pmat if applicable).

🤖 Generated with Claude Code

…ch jobs Default /__w/<repo>/<repo>/target/ is shared across concurrent PR builds on the same self-hosted runner; cargo's fingerprint directory is not concurrent-safe and fails stochastically with "No such file or directory" errors. 15 consecutive collisions observed on aprender PR #1019 today. The aprender repo's own workspace-test job (ci.yml, task #134) already mitigates the same pathology by mounting a per-PR target dir; port the pattern into the fleet-reusable ci / test / lint / coverage / bench jobs so every repo consuming sovereign-ci.yml gets the same isolation. Volume: /mnt/nvme-raid0/targets/sovereign-ci-<repo>/<pr-or-branch>:/workspace/target Env: CARGO_TARGET_DIR=/workspace/target (per job, not workflow-level, to avoid leaking into the non-container security/provenance jobs). Security job unchanged (no cargo build, no target dir needed). Provenance job unchanged (GitHub-hosted, no container). Gate job unchanged (no build). Addresses: paiml/infra#75 Related: aprender task #163, aprender task #134, aprender PR #1019 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ate race) (#1693) ## Root cause (five-whys) 1. CI fails with `couldn't create a temp dir: /workspace/target/debug/deps/...` 2. `deps/` was unlinked mid-build, then cargo tried to create a tempfile inside it 3. A concurrent process on the same host path is racing this run's cargo 4. `concurrency.cancel-in-progress: true` (ci.yml:22) cancels the old run when `update-branch` pushes a new commit — but cancellation is signal-based with a 30s SIGTERM→SIGKILL window. During those 30s the old cargo is still writing/cleaning files in `/workspace/target/...` 5. The mount was per-PR (`aprender-ci/${PR_OR_REF}`), so the new run mounted THE SAME host directory as the dying old run — they shared `/workspace/target/debug/deps/` on the host ## What the prior fix solved vs. what it introduced The 2026-04-23 fleet fix (paiml/.github#31) moved sovereign-ci from a shared runner-wide target dir to per-PR isolation, solving CROSS-PR same-runner collisions. That worked. But the per-PR isolation INTRODUCED this cancel-corrupt-state collision for SAME-PR sequential runs (every `update-branch` or new push triggers a new run that mounts the SAME host path). ## Fix Bump the mount path one level deeper: `aprender-ci/${PR_OR_REF}` → `aprender-ci/${PR_OR_REF}/run-${GITHUB_RUN_ID}` Now every CI run gets its own isolated target dir; no two cargo invocations ever share a host directory. ## Trade-offs - **sccache**: unchanged. Lives on its own mount (`/home/noah/data/sccache`) and continues to dedupe across ALL runs of ALL PRs. This is the heavy lifter — typical 80%+ hit rate on warm cache. - **cargo-incremental**: lost per new run. Cost is small because sccache already covers most of the rebuild surface; cargo-incremental is the per-crate metadata, not the codegen. - **Disk**: per-run dirs accumulate under `aprender-ci/<PR>/`. Existing disk-guard hook deletes old PR dirs after merge (incl. all run subdirs). No new cleanup logic needed. ## Verified - `python3 -c "import yaml; yaml.safe_load(open('.github/workflows/ci.yml'))"` → OK - 4 mount lines updated (workspace-test: 3 docker steps + ownership-fix step) - Inline comment block documents the cancel-corrupt-state race for future maintainers ## Impact Once this lands, the recurring "No such file or directory (os error 2)" flakes hitting the queue should stop. Currently 4+ PRs blocked at the same defect simultaneously. Refs Toyota Way andon: this is the second iteration of the per-PR-target fix; the 2026-04-23 fix was correct in direction (isolate from runner shared state) but didn't account for the cancel-in-progress race. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift merged commit 136863e into main Apr 23, 2026
2 checks passed

noahgift deleted the fix/per-pr-target-dir branch April 23, 2026 07:24

noahgift mentioned this pull request Apr 23, 2026

docs(ship-two-001): v2.30.0 — SHIP-007 merged + fleet CI hardened + session wrap paiml/aprender#1024

Closed

3 tasks

noahgift mentioned this pull request May 15, 2026

fix(ci)!: per-run target-dir — P0 STOP THE LINE on cancel-corrupt-state race paiml/aprender#1693

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sovereign-ci): per-PR CARGO_TARGET_DIR for test/lint/coverage/bench jobs#31

fix(sovereign-ci): per-PR CARGO_TARGET_DIR for test/lint/coverage/bench jobs#31
noahgift merged 1 commit into
mainfrom
fix/per-pr-target-dir

noahgift commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented Apr 23, 2026

Summary

Problem

Fix

Jobs touched

Jobs NOT touched (intentional)

Validation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant