Skip to content

fix(sovereign-ci): per-PR CARGO_TARGET_DIR for test/lint/coverage/bench jobs#31

Merged
noahgift merged 1 commit into
mainfrom
fix/per-pr-target-dir
Apr 23, 2026
Merged

fix(sovereign-ci): per-PR CARGO_TARGET_DIR for test/lint/coverage/bench jobs#31
noahgift merged 1 commit into
mainfrom
fix/per-pr-target-dir

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Port the per-PR CARGO_TARGET_DIR isolation pattern (already proven in aprender's workspace-test job) into the fleet-wide reusable sovereign-ci.yml, so every repo that consumes this workflow gets concurrency-safe builds on shared self-hosted runners.

Closes paiml/infra#75.

Problem

The reusable sovereign-ci.yml ci / test, ci / lint, ci / coverage, and ci / bench jobs all default CARGO_TARGET_DIR to /__w/<repo>/<repo>/target/. When two or more PR builds for the same repo land on the same self-hosted runner at overlapping times, they write to the same target directory. Cargo's fingerprint directory is not concurrent-safe, producing cascading No such file or directory errors for half-written .rmeta/.rlib artifacts:

error: could not read dependency information for `nix` (/__w/aprender/aprender/target/debug/deps/libnix-....rmeta)
error: could not read dependency information for `wgpu_core` (/__w/aprender/aprender/target/debug/deps/libwgpu_core-....rmeta)
...

15 consecutive collisions were observed on aprender PR #1019 over 2026-04-22 → 2026-04-23, wedging SHIP-007 PARTIAL-discharge work behind infrastructure flakes.

Fix

Mirror aprender's workspace-test pattern (already proven in production — aprender task #134) into the 4 container-based jobs of the reusable workflow:

  1. Mount a per-PR (or per-branch, for push-to-main) target directory: /mnt/nvme-raid0/targets/sovereign-ci-${{ inputs.repo }}/${{ github.event.pull_request.number || github.ref_name }}:/workspace/target
  2. Set job-level env.CARGO_TARGET_DIR: /workspace/target

Each PR gets its own fingerprint directory. Zero contention. No change to build product or cache behavior.

Jobs touched

  • test — full rationale comment inline (this is the canonical instance)
  • lint — short reference comment (# paiml/infra#75: per-PR CARGO_TARGET_DIR isolation — see test job.)
  • coverage — short reference comment
  • bench — short reference comment

Jobs NOT touched (intentional)

  • security (cargo-audit) — no build, no /workspace/target usage
  • provenance — runs on ubuntu-latest (GitHub-hosted), no self-hosted runner contention
  • gate — no build

Validation

  • python3 -c "import yaml; yaml.safe_load(open('.github/workflows/sovereign-ci.yml'))" → valid.
  • Volume path interpolation uses workflow-config-time expressions (${{ inputs.repo }}, ${{ github.event.pull_request.number }}, ${{ github.ref_name }}) which resolve before container start — safe in container.volumes:.
  • Mirrors an already-production pattern at aprender .github/workflows/ci.yml::workspace-test (lines 52-56).

Test plan

  • Merge this PR and let aprender rerun PR #1019 ci / test with 2+ concurrent PR builds on the same runner.
  • Expect: each PR writes to its own /mnt/nvme-raid0/targets/sovereign-ci-aprender/<pr>/ directory. No No such file or directory errors.
  • Confirm consumers in the fleet that share paiml/.github still build (spot-check: realizar, trueno, pmat if applicable).

🤖 Generated with Claude Code

…ch jobs

Default /__w/<repo>/<repo>/target/ is shared across concurrent PR builds
on the same self-hosted runner; cargo's fingerprint directory is not
concurrent-safe and fails stochastically with "No such file or directory"
errors. 15 consecutive collisions observed on aprender PR #1019 today.

The aprender repo's own workspace-test job (ci.yml, task #134) already
mitigates the same pathology by mounting a per-PR target dir; port the
pattern into the fleet-reusable ci / test / lint / coverage / bench jobs
so every repo consuming sovereign-ci.yml gets the same isolation.

Volume: /mnt/nvme-raid0/targets/sovereign-ci-<repo>/<pr-or-branch>:/workspace/target
Env:    CARGO_TARGET_DIR=/workspace/target (per job, not workflow-level,
        to avoid leaking into the non-container security/provenance jobs).

Security job unchanged (no cargo build, no target dir needed).
Provenance job unchanged (GitHub-hosted, no container).
Gate job unchanged (no build).

Addresses: paiml/infra#75
Related: aprender task #163, aprender task #134, aprender PR #1019

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 136863e into main Apr 23, 2026
2 checks passed
@noahgift noahgift deleted the fix/per-pr-target-dir branch April 23, 2026 07:24
noahgift added a commit to paiml/aprender that referenced this pull request May 15, 2026
…ate race) (#1693)

## Root cause (five-whys)

1. CI fails with `couldn't create a temp dir: /workspace/target/debug/deps/...`
2. `deps/` was unlinked mid-build, then cargo tried to create a tempfile inside it
3. A concurrent process on the same host path is racing this run's cargo
4. `concurrency.cancel-in-progress: true` (ci.yml:22) cancels the old run when
   `update-branch` pushes a new commit — but cancellation is signal-based with
   a 30s SIGTERM→SIGKILL window. During those 30s the old cargo is still
   writing/cleaning files in `/workspace/target/...`
5. The mount was per-PR (`aprender-ci/${PR_OR_REF}`), so the new run mounted
   THE SAME host directory as the dying old run — they shared
   `/workspace/target/debug/deps/` on the host

## What the prior fix solved vs. what it introduced

The 2026-04-23 fleet fix (paiml/.github#31) moved sovereign-ci from a shared
runner-wide target dir to per-PR isolation, solving CROSS-PR same-runner
collisions. That worked. But the per-PR isolation INTRODUCED this
cancel-corrupt-state collision for SAME-PR sequential runs (every `update-branch`
or new push triggers a new run that mounts the SAME host path).

## Fix

Bump the mount path one level deeper:
  `aprender-ci/${PR_OR_REF}` → `aprender-ci/${PR_OR_REF}/run-${GITHUB_RUN_ID}`

Now every CI run gets its own isolated target dir; no two cargo invocations
ever share a host directory.

## Trade-offs

- **sccache**: unchanged. Lives on its own mount (`/home/noah/data/sccache`)
  and continues to dedupe across ALL runs of ALL PRs. This is the heavy
  lifter — typical 80%+ hit rate on warm cache.
- **cargo-incremental**: lost per new run. Cost is small because sccache
  already covers most of the rebuild surface; cargo-incremental is the
  per-crate metadata, not the codegen.
- **Disk**: per-run dirs accumulate under `aprender-ci/<PR>/`. Existing
  disk-guard hook deletes old PR dirs after merge (incl. all run subdirs).
  No new cleanup logic needed.

## Verified

- `python3 -c "import yaml; yaml.safe_load(open('.github/workflows/ci.yml'))"` → OK
- 4 mount lines updated (workspace-test: 3 docker steps + ownership-fix step)
- Inline comment block documents the cancel-corrupt-state race for future maintainers

## Impact

Once this lands, the recurring "No such file or directory (os error 2)"
flakes hitting the queue should stop. Currently 4+ PRs blocked at the
same defect simultaneously.

Refs Toyota Way andon: this is the second iteration of the per-PR-target
fix; the 2026-04-23 fix was correct in direction (isolate from runner
shared state) but didn't account for the cancel-in-progress race.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant