Skip to content

Flaky Rust Tests (agent-relay-broker) (macos-latest) — systemic fix #980

@willwashburn

Description

@willwashburn

Symptom

The Rust Tests (agent-relay-broker) (macos-latest) job in .github/workflows/test.yml:133 intermittently fails on PRs that touch zero Rust code. The Ubuntu matrix branch of the same job passes; the separate Rust Tests (macos-latest) job (rust-ci.yml) also passes. The failures pre-date this PR — see e.g. dd907ac chore(ci): retrigger flaky macOS Rust Tests, which retriggered exactly this job and called out the timing-sensitive delivery_retry_transient_blip_emits_failed_event_for_present_worker.

Most recent flake: #973 commit 138d8e4, job 77543979753. PR had only TypeScript-side changes; same job on Ubuntu passed.

Root causes (two independent classes)

1. Tests spawn real subprocesses with timing assumptions.

crates/broker/src/runtime/tests.rs uses tokio::process::Command::new("cat") (around line 60) and several tests assert on bounded timing windows — e.g. "after SIGKILL, the next write to the child's stdin must surface as an error within X ms." macOS's process scheduling, signal delivery, and pipe-EPIPE timing are noticeably slower and less deterministic than Linux, so the X-ms window blows on macOS runners under load.

2. Tests mutate the process env under an in-process mutex.

Multiple tests do std::env::set_var / remove_var against shared env (AGENT_RELAY_STARTUP_ERROR_CODE, AGENT_RELAY_DELIVERY_RETRY_MS, AGENT_RELAY_HTTP_API_*_TIMEOUT_MS, etc.) inside env_test_lock (crates/broker/src/runtime/tests.rs:43). That mutex only serializes tests that take it — any other test reading env at the same time still races. std::env::set_var is unsafe fn in Rust 2024 precisely because libc::setenv/getenv are not thread-safe on macOS.

Ranked fixes

Fix Effort What it solves
A Wrap cargo test --release in nick-fields/retry@v2 (or gh run rerun --failed) on the macOS broker job. ~10 lines of YAML. Hides the noise. Doesn't fix the underlying tests.
B Run cargo test --release -- --test-threads=1 on macOS only (matrix branch). ~5 lines of YAML. Kills the env-mutation races. Doesn't help subprocess-timing flakes. ~2-3× slower.
C Refactor: replace real-subprocess tests with a fake WorkerHandle that emits deterministic write-error events; replace std::env::set_var with a BrokerConfig struct threaded through call sites. ~1 week of broker work. Fully eliminates both flake classes.
D Add serial_test::serial to env-mutating tests + temp-env for scoped env writes (no production-code change). ~half a day. Fixes env races safely. Subprocess-timing flakes remain.

Recommendation

Land A as a stopgap immediately so red CI stops blocking PRs that don't touch Rust. Then take C in stages as the long-term fix — the subprocess-timing class and the env-mutation class can be tackled independently, in either order.

If C is too much, D is a clean intermediate that removes the env class without any production code touched.

Out of scope here

This issue captures the diagnosis only; the actual fix is its own work and shouldn't ride on unrelated PRs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions