Swately · Swately · May 21, 2026 · May 21, 2026 · May 21, 2026 · May 21, 2026
@@ -68,9 +68,11 @@ jobs:
         # 4 tests excluded: V-Cache via CPUID + cache hierarchy walk and
         # /proc/$pid/task in WSL2 are Windows-specific probe paths; the
         # framework side is correct on Linux for all other tests.
-        # --repeat until-pass:3 absorbs flaky failures from runner -j
-        # contention. The stigmergy_pheromone_test scaling assertions
-        # auto-relax when CI=true (see test §10).
+        # --repeat until-pass:3 absorbs the residual jitter of timing-bounded
+        # tests under runner -j contention. (Timing/scaling assertions that
+        # were genuinely non-deterministic have been made deterministic in the
+        # tests themselves — e.g. stigmergy_pheromone_test §10 now asserts only
+        # functional throughput; scaling shape lives in bench/.)
         run: |
           ctest --test-dir build -j 4 --output-on-failure --timeout 60 \
             --repeat until-pass:3 \

@@ -0,0 +1,131 @@
+# Comparison benches — independent re-verification
+# Date: 2026-05-21  Host: AMD Ryzen 9 7950X3D (pinned CCD0/V-Cache via PHYRIAD_BENCH_AFFINITY=0x...FFFF)
+# Toolchain: WSL2 Ubuntu-24.04, g++-13.3.0, Release -O3, no LTO. Libs: Boost, Taskflow, concurrencpp, gRPC (all ENABLED).
+# Purpose: confirm D-1/D-2/D-3 fixes (both sides complete) + win DIRECTION matches BENCHMARK_FAIRNESS.md. Single-run; SoT medians are authoritative for ratios.
+
+########## ring_vs_boost_lockfree ##########
+
+=== ring_vs_boost_lockfree ===
+  Phase P4.4 — Phyriad transport rings vs Boost.Lockfree, apples-to-apples
+  TSC frequency: 4.316 GHz (invariant ref-clock)
+  Compiler:      gcc 13.3.0
+
+  -- §1.A  Phyriad Ring<uint64_t, 1024> SPSC throughput --
+    Phyriad Ring SPSC: sent=75732828 received=75731805 in 2.000s → 37.86 M op/s [lossless: sent-received=1023 (<= Cap=1024 in-flight at stop)]
+
+  -- §1.B  boost::lockfree::spsc_queue<uint64_t, capacity<1024>> throughput --
+    Boost spsc_queue:  sent=63928209 received=63927185 in 2.000s → 31.96 M op/s [lossless: sent-received=1024 (<= Cap=1024 in-flight at stop)]
+
+  -- §2.A  Phyriad RingChannel<uint64_t, 4096> 2P+1C count-based --
+    Phyriad RingChannel 2P+1C: 2000000 received in 0.060s → 33.18 M op/s
+    [finding] RingChannel 2P+1C now completes reliably every run (CAS-claim fix); Boost also completes. Phyriad WINS MPMC (~4x).
+
+  -- §2.B  boost::lockfree::queue<uint64_t, capacity<4096>> 2P+1C count-based --
+    Boost queue 2P+1C:        2000000 sent, 2000000 received in 0.246s → 8.13 M op/s
+
+  [DONE] ring_vs_boost_lockfree
+
+########## pool_vs_taskflow ##########
+
+=== pool_vs_taskflow ===
+  Phyriad pool vs Taskflow tf::Executor, apples-to-apples (V2-P2 honest)
+  TSC frequency: 4.256 GHz (invariant ref-clock)
+  Compiler:      gcc 13.3.0
+
+  --- environment (record this with any published number) ---
+  compiler:     gcc 13.3.0
+  c++ std:      202100
+  os:           Linux
+  compiled ISA: AVX512F AVX2 FMA BMI2 SSE4.2 
+  build:        Release (NDEBUG)
+  cpu:          32 logical / 32 physical cores, 1 CCD(s), V-Cache cores=32, max 0 MHz
+  TSC:          nominal 0.000 GHz (CPUID) / calibrated 4.256 GHz — invariant ref-clock (NOT core cycles)
+  affinity:     0x000000000000FFFF ; current core=12
+
+  -- §1  Phyriad pool — submit + wait_result round-trip (1 submitter) --
+    Phyriad pool round-trip                      1.02 M op/s   983.12 ns/op   4184.3 ref-cyc/op  (0.049 s total)
+
+  -- §1  Taskflow tf::Executor — async() + future.wait() (1 submitter) --
+    Taskflow async+wait round-trip               0.08 M op/s   12155.88 ns/op   51686.5 ref-cyc/op  (0.608 s total)
+
+  -- §2  Phyriad pool — submit→complete throughput (1 submitter) --
+    Phyriad:  200000 tasks submit→complete in 0.025s → 8.12 M tasks/s
+
+  -- §2  Taskflow tf::Executor — submit→complete throughput (1 submitter) --
+    Taskflow: 200000 tasks submit→complete in 0.054s → 3.68 M tasks/s
+
+  -- §3  Phyriad pool — 4 concurrent submitters × 50K (fire-and-forget) --
+    Phyriad:  4×50000 submit→complete in 0.019s → 10.44 M tasks/s aggregate
+
+  -- §3  Taskflow tf::Executor — 4 concurrent submitters × 50K → complete --
+    Taskflow: 4×50000 submitted=200000 wall=0.084s → 2.37 M op/s aggregate
+
+  [DONE] pool_vs_taskflow
+
+########## pool_vs_concurrencpp ##########
+
+=== pool_vs_concurrencpp ===
+  Phyriad pool vs concurrencpp thread_pool_executor, apples-to-apples (V2-P3 honest)
+  TSC frequency: 4.245 GHz (invariant ref-clock)
+  Compiler:      gcc 13.3.0
+
+  --- environment (record this with any published number) ---
+  compiler:     gcc 13.3.0
+  c++ std:      202100
+  os:           Linux
+  compiled ISA: AVX512F AVX2 FMA BMI2 SSE4.2 
+  build:        Release (NDEBUG)
+  cpu:          32 logical / 32 physical cores, 1 CCD(s), V-Cache cores=32, max 0 MHz
+  TSC:          nominal 0.000 GHz (CPUID) / calibrated 4.245 GHz — invariant ref-clock (NOT core cycles)
+  affinity:     0x000000000000FFFF ; current core=6
+
+  -- §1  Phyriad pool — submit + wait_result round-trip (1 submitter) --
+    Phyriad pool round-trip (try_submit)         0.90 M op/s   1108.12 ns/op   4704.6 ref-cyc/op  (0.055 s total)
+
+  -- §1  concurrencpp thread_pool_executor — submit + .get() (1 submitter) --
+    concurrencpp submit+.get() round-trip        0.06 M op/s   16031.34 ns/op   68000.6 ref-cyc/op  (0.802 s total)
+
+  -- §2  Phyriad pool — submit→complete throughput (1 submitter) --
+    Phyriad:     200000 tasks submit→complete in 0.015s → 13.24 M tasks/s
+
+  -- §2  concurrencpp thread_pool_executor — submit→complete (1 submitter) --
+    concurrencpp: 200000 tasks submit→complete in 1.654s → 0.12 M tasks/s
+
+  -- §3  Phyriad pool — 4 concurrent submitters × 50K (fire-and-forget) --
+    Phyriad:     4×50000 submit→complete in 0.016s → 12.22 M tasks/s aggregate
+
+  -- §3  concurrencpp — 4 concurrent submitters × 50K (post) --
+    concurrencpp: 4×50000 submit→complete in 0.367s → 0.54 M tasks/s aggregate
+
+  [DONE] pool_vs_concurrencpp
+
+########## pn1_vs_grpc ##########
+
+=== bench_pn1_vs_grpc ===
+  PhyriadNet/1 vs gRPC unary RPC (loopback echo)
+  TSC frequency: 4.221 GHz (invariant ref-clock)
+  Compiler:      gcc 13.3.0
+
+  --- environment (record this with any published number) ---
+  compiler:     gcc 13.3.0
+  c++ std:      202100
+  os:           Linux
+  compiled ISA: AVX512F AVX2 FMA BMI2 SSE4.2 
+  build:        Release (NDEBUG)
+  cpu:          32 logical / 32 physical cores, 1 CCD(s), V-Cache cores=32, max 0 MHz
+  TSC:          nominal 0.000 GHz (CPUID) / calibrated 4.221 GHz — invariant ref-clock (NOT core cycles)
+  affinity:     0x000000000000FFFF ; current core=0
+  NOTE: NOT like-for-like — a thin UDP framing protocol vs a full TCP/HTTP2/protobuf RPC stack; PN1 wins partly because it does less.
+
+--- §1  Latency (RTT, 2000 iters, 32 byte payload) ---
+    gRPC (TCP+HTTP/2+pb)   min= 111932  p50= 157757  p99=  200985  max=   280441  mean=  165450  (ns)
+    PhyriadNet/1 (UDP)     min=  29774  p50=  48577  p99=   56807  max=   104257  mean=   48690  (ns)
+    PN1 vs gRPC:           3.25x faster at p50
+
+--- §2  Throughput (5000 iters) ---
+    gRPC                                         0.01 M op/s   173915.53 ns/op   733057.8 ref-cyc/op  (0.870 s total)
+    PhyriadNet/1 (UDP)                           0.02 M op/s   50005.05 ns/op   210522.7 ref-cyc/op  (0.250 s total)
+    PN1 ratio:             3.48x of gRPC throughput
+
+  [DONE] bench_pn1_vs_grpc
+