Symptom
The Rust Tests (agent-relay-broker) (macos-latest) job in .github/workflows/test.yml:133 intermittently fails on PRs that touch zero Rust code. The Ubuntu matrix branch of the same job passes; the separate Rust Tests (macos-latest) job (rust-ci.yml) also passes. The failures pre-date this PR — see e.g. dd907ac chore(ci): retrigger flaky macOS Rust Tests, which retriggered exactly this job and called out the timing-sensitive delivery_retry_transient_blip_emits_failed_event_for_present_worker.
Most recent flake: #973 commit 138d8e4, job 77543979753. PR had only TypeScript-side changes; same job on Ubuntu passed.
Root causes (two independent classes)
1. Tests spawn real subprocesses with timing assumptions.
crates/broker/src/runtime/tests.rs uses tokio::process::Command::new("cat") (around line 60) and several tests assert on bounded timing windows — e.g. "after SIGKILL, the next write to the child's stdin must surface as an error within X ms." macOS's process scheduling, signal delivery, and pipe-EPIPE timing are noticeably slower and less deterministic than Linux, so the X-ms window blows on macOS runners under load.
2. Tests mutate the process env under an in-process mutex.
Multiple tests do std::env::set_var / remove_var against shared env (AGENT_RELAY_STARTUP_ERROR_CODE, AGENT_RELAY_DELIVERY_RETRY_MS, AGENT_RELAY_HTTP_API_*_TIMEOUT_MS, etc.) inside env_test_lock (crates/broker/src/runtime/tests.rs:43). That mutex only serializes tests that take it — any other test reading env at the same time still races. std::env::set_var is unsafe fn in Rust 2024 precisely because libc::setenv/getenv are not thread-safe on macOS.
Ranked fixes
|
Fix |
Effort |
What it solves |
| A |
Wrap cargo test --release in nick-fields/retry@v2 (or gh run rerun --failed) on the macOS broker job. |
~10 lines of YAML. |
Hides the noise. Doesn't fix the underlying tests. |
| B |
Run cargo test --release -- --test-threads=1 on macOS only (matrix branch). |
~5 lines of YAML. |
Kills the env-mutation races. Doesn't help subprocess-timing flakes. ~2-3× slower. |
| C |
Refactor: replace real-subprocess tests with a fake WorkerHandle that emits deterministic write-error events; replace std::env::set_var with a BrokerConfig struct threaded through call sites. |
~1 week of broker work. |
Fully eliminates both flake classes. |
| D |
Add serial_test::serial to env-mutating tests + temp-env for scoped env writes (no production-code change). |
~half a day. |
Fixes env races safely. Subprocess-timing flakes remain. |
Recommendation
Land A as a stopgap immediately so red CI stops blocking PRs that don't touch Rust. Then take C in stages as the long-term fix — the subprocess-timing class and the env-mutation class can be tackled independently, in either order.
If C is too much, D is a clean intermediate that removes the env class without any production code touched.
Out of scope here
This issue captures the diagnosis only; the actual fix is its own work and shouldn't ride on unrelated PRs.
Symptom
The
Rust Tests (agent-relay-broker) (macos-latest)job in.github/workflows/test.yml:133intermittently fails on PRs that touch zero Rust code. The Ubuntu matrix branch of the same job passes; the separateRust Tests (macos-latest)job (rust-ci.yml) also passes. The failures pre-date this PR — see e.g.dd907ac chore(ci): retrigger flaky macOS Rust Tests, which retriggered exactly this job and called out the timing-sensitivedelivery_retry_transient_blip_emits_failed_event_for_present_worker.Most recent flake: #973 commit
138d8e4, job 77543979753. PR had only TypeScript-side changes; same job on Ubuntu passed.Root causes (two independent classes)
1. Tests spawn real subprocesses with timing assumptions.
crates/broker/src/runtime/tests.rsusestokio::process::Command::new("cat")(around line 60) and several tests assert on bounded timing windows — e.g. "afterSIGKILL, the next write to the child's stdin must surface as an error within X ms." macOS's process scheduling, signal delivery, and pipe-EPIPE timing are noticeably slower and less deterministic than Linux, so the X-ms window blows on macOS runners under load.2. Tests mutate the process env under an in-process mutex.
Multiple tests do
std::env::set_var/remove_varagainst shared env (AGENT_RELAY_STARTUP_ERROR_CODE,AGENT_RELAY_DELIVERY_RETRY_MS,AGENT_RELAY_HTTP_API_*_TIMEOUT_MS, etc.) insideenv_test_lock(crates/broker/src/runtime/tests.rs:43). That mutex only serializes tests that take it — any other test reading env at the same time still races.std::env::set_varisunsafe fnin Rust 2024 precisely becauselibc::setenv/getenvare not thread-safe on macOS.Ranked fixes
cargo test --releaseinnick-fields/retry@v2(orgh run rerun --failed) on the macOS broker job.cargo test --release -- --test-threads=1on macOS only (matrix branch).WorkerHandlethat emits deterministic write-error events; replacestd::env::set_varwith aBrokerConfigstruct threaded through call sites.serial_test::serialto env-mutating tests +temp-envfor scoped env writes (no production-code change).Recommendation
Land A as a stopgap immediately so red CI stops blocking PRs that don't touch Rust. Then take C in stages as the long-term fix — the subprocess-timing class and the env-mutation class can be tackled independently, in either order.
If C is too much, D is a clean intermediate that removes the env class without any production code touched.
Out of scope here
This issue captures the diagnosis only; the actual fix is its own work and shouldn't ride on unrelated PRs.