Skip to content

test: de-flake watchdog + pheromone-scaling assertions; reconcile TEST_INVENTORY#16

Merged
Swately merged 4 commits into
mainfrom
test/deflake-and-inventory-reconcile
May 21, 2026
Merged

test: de-flake watchdog + pheromone-scaling assertions; reconcile TEST_INVENTORY#16
Swately merged 4 commits into
mainfrom
test/deflake-and-inventory-reconcile

Conversation

@Swately
Copy link
Copy Markdown
Owner

@Swately Swately commented May 21, 2026

Summary

Found via an empirical flake loop (full suite ×10 under ctest -j4 on WSL2/Linux,
matching the CI runner), then fixed deterministically:

  • orchestration_test — watchdog §4 false-negative (100 ms detection budget too
    tight under -j → monitor thread starved); §3 false-positive (500 ms node timeout
    vs stretched inter-heartbeat sleeps). Both now use generous deadlines; both still
    assert the watchdog's real behaviour.
  • stigmergy_pheromone_test §10 — a hard scaling-shape floor (r4 >= 2.0× r1)
    is a perf claim, not correctness; it flaked under ctest -j even off-CI. Now
    asserts functional throughput only; scaling shape stays in bench/.

Verified 0/18 flaky under ctest -j4 after the fixes (was 1/10 and 1/12 before);
pass on WSL gcc-13 + Windows MinGW.

Also reconciles TEST_INVENTORY.md with the actual code: 4 microbenches were
already migrated to the v2 harness (warmup + escape(); the 0xDEADBEEF hacks are
gone) but still tagged Tier C — bumped C→B; the comparison verdicts now defer to
BENCHMARK_FAIRNESS.md (the SoT) instead of describing the old broken state.

Test plan

  • WSL gcc-13 + Windows MinGW: both tests pass
  • full suite 0/18 flaky under ctest -j4
  • CI (gcc-13 + clang-18, 2-vCPU) green
  • lint-docs green; no include/ change so doc-sync is unaffected

🤖 Generated with Claude Code

Swately and others added 4 commits May 21, 2026 12:58
… under ctest -j

Found by an empirical flake loop (full suite x10 under `ctest -j4` on WSL2/Linux,
matching CI): two tests fail rarely under -j oversubscription.

orchestration_test §4 (watchdog timeout): a missed heartbeat must be detected by
the monitor thread, which the test waited only 100 ms for — under -j contention
on a 2-vCPU box that thread can be descheduled longer, a false-negative flake.
Use a generous 2 s deadline (loop still exits the instant the fault fires, ~12 ms
normally). §3 (heartbeat keeps node alive): bump the node timeout 500 ms -> 2 s so
a stretched inter-heartbeat sleep can't trip a false-positive fault. Neither masks
anything — both still assert the watchdog's actual behaviour.

stigmergy_pheromone_test §10: asserted a hard scaling-SHAPE floor (r4 >= 2.0x r1)
on non-CI machines. That is a PERFORMANCE claim, inherently contention-sensitive,
and it flaked under `ctest -j` even on a dev box (it was only skipped when CI=true).
The unit test now asserts FUNCTIONAL throughput only (each thread count produces
work); the scaling shape is measured properly, with affinity pinning, in bench/ —
its correct home. Drops the CI-vs-dev branch entirely → deterministic everywhere.

Verified: full suite 0/18 flaky under `ctest -j4` after the fixes (orchestration
was 1/10, stigmergy_pheromone 1/12 before); both pass on WSL gcc-13 + Windows MinGW.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…proved) code

The §3.1 matrix still tagged bench_ring_channel / bench_circuit_breaker /
bench_frame_arena / bench_hal_primitives as Tier C with "add warmup / replace
0xDEADBEEF DCE hack" — but the V1 harness migration already did that (verified in
source: all use bench::measure_repeated + bench::escape(); the hacks are gone).
Bump them C->B to match the code and BENCHMARK_FAIRNESS.md.

Likewise the §3.2 / §4 comparison rows still described the OLD broken state
(14x MPMC import, 178x submit-only, gRPC inproc-transport lie) — contradicting
BENCHMARK_FAIRNESS.md, which records these as resolved (D-1/D-2/D-3). Defer the
ratios to that SoT and mark the remaining open item honestly: independent re-run
of the comparison numbers on this machine (Boost/Taskflow/gRPC available on WSL;
concurrencpp pending). Also: bench::escape() promotion to BenchHarness.hpp is
done (D2 resolved); refresh the §7 summary and the cross-cutting issues list.

No code change — documentation reconciliation only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on benches

Rebuilt and re-ran ring_vs_boost_lockfree, pool_vs_taskflow, pool_vs_concurrencpp,
and pn1_vs_grpc on the WSL env (g++-13, Release, pinned to CCD0). All four
third-party libs are present (concurrencpp included — the earlier "pending" note
was wrong). Each bench completes losslessly — confirming the D-1 livelock,
D-2 pool task-loss, and D-3 loopback fixes — and wins in the direction
BENCHMARK_FAIRNESS.md records: SPSC both lossless, MPMC ~4×, pool submit→completion
~2.1×/~3.8× (both 200000/200000), concurrencpp large (genuine coroutine overhead),
pn1_vs_grpc 3.6× p50 with the not-like-for-like caveat printed first. Exact
magnitudes vary with the machine/V-Cache pinning, so the SoT medians stay
authoritative for the ratios; this run confirms direction + lossless completion.

Raw capture added under docs/perf-history/. Updates footnote 8, the §4 map rows,
and the §7 summary accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
COMPARISONS.md was reconciled to BENCHMARK_FAIRNESS.md on 2026-05-21 (honest
summary deferred to the SoT; stale section tables carry ⛔ Superseded banners
with the old 5.5×/14×/18×/178×/1.41× numbers retired in-place). Update the
inventory's "documentation inconsistencies" list to reflect that.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Swately Swately merged commit 3dcf10d into main May 21, 2026
10 checks passed
@Swately Swately deleted the test/deflake-and-inventory-reconcile branch May 21, 2026 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant