Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions .github/workflows/ci-linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,9 +68,11 @@ jobs:
# 4 tests excluded: V-Cache via CPUID + cache hierarchy walk and
# /proc/$pid/task in WSL2 are Windows-specific probe paths; the
# framework side is correct on Linux for all other tests.
# --repeat until-pass:3 absorbs flaky failures from runner -j
# contention. The stigmergy_pheromone_test scaling assertions
# auto-relax when CI=true (see test §10).
# --repeat until-pass:3 absorbs the residual jitter of timing-bounded
# tests under runner -j contention. (Timing/scaling assertions that
# were genuinely non-deterministic have been made deterministic in the
# tests themselves — e.g. stigmergy_pheromone_test §10 now asserts only
# functional throughput; scaling shape lives in bench/.)
run: |
ctest --test-dir build -j 4 --output-on-failure --timeout 60 \
--repeat until-pass:3 \
Expand Down
131 changes: 131 additions & 0 deletions docs/perf-history/v2_comparisons_reverify_2026-05-21_wsl.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Comparison benches — independent re-verification
# Date: 2026-05-21 Host: AMD Ryzen 9 7950X3D (pinned CCD0/V-Cache via PHYRIAD_BENCH_AFFINITY=0x...FFFF)
# Toolchain: WSL2 Ubuntu-24.04, g++-13.3.0, Release -O3, no LTO. Libs: Boost, Taskflow, concurrencpp, gRPC (all ENABLED).
# Purpose: confirm D-1/D-2/D-3 fixes (both sides complete) + win DIRECTION matches BENCHMARK_FAIRNESS.md. Single-run; SoT medians are authoritative for ratios.

########## ring_vs_boost_lockfree ##########

=== ring_vs_boost_lockfree ===
Phase P4.4 — Phyriad transport rings vs Boost.Lockfree, apples-to-apples
TSC frequency: 4.316 GHz (invariant ref-clock)
Compiler: gcc 13.3.0

-- §1.A Phyriad Ring<uint64_t, 1024> SPSC throughput --
Phyriad Ring SPSC: sent=75732828 received=75731805 in 2.000s → 37.86 M op/s [lossless: sent-received=1023 (<= Cap=1024 in-flight at stop)]

-- §1.B boost::lockfree::spsc_queue<uint64_t, capacity<1024>> throughput --
Boost spsc_queue: sent=63928209 received=63927185 in 2.000s → 31.96 M op/s [lossless: sent-received=1024 (<= Cap=1024 in-flight at stop)]

-- §2.A Phyriad RingChannel<uint64_t, 4096> 2P+1C count-based --
Phyriad RingChannel 2P+1C: 2000000 received in 0.060s → 33.18 M op/s
[finding] RingChannel 2P+1C now completes reliably every run (CAS-claim fix); Boost also completes. Phyriad WINS MPMC (~4x).

-- §2.B boost::lockfree::queue<uint64_t, capacity<4096>> 2P+1C count-based --
Boost queue 2P+1C: 2000000 sent, 2000000 received in 0.246s → 8.13 M op/s

[DONE] ring_vs_boost_lockfree

########## pool_vs_taskflow ##########

=== pool_vs_taskflow ===
Phyriad pool vs Taskflow tf::Executor, apples-to-apples (V2-P2 honest)
TSC frequency: 4.256 GHz (invariant ref-clock)
Compiler: gcc 13.3.0

--- environment (record this with any published number) ---
compiler: gcc 13.3.0
c++ std: 202100
os: Linux
compiled ISA: AVX512F AVX2 FMA BMI2 SSE4.2
build: Release (NDEBUG)
cpu: 32 logical / 32 physical cores, 1 CCD(s), V-Cache cores=32, max 0 MHz
TSC: nominal 0.000 GHz (CPUID) / calibrated 4.256 GHz — invariant ref-clock (NOT core cycles)
affinity: 0x000000000000FFFF ; current core=12

-- §1 Phyriad pool — submit + wait_result round-trip (1 submitter) --
Phyriad pool round-trip 1.02 M op/s 983.12 ns/op 4184.3 ref-cyc/op (0.049 s total)

-- §1 Taskflow tf::Executor — async() + future.wait() (1 submitter) --
Taskflow async+wait round-trip 0.08 M op/s 12155.88 ns/op 51686.5 ref-cyc/op (0.608 s total)

-- §2 Phyriad pool — submit→complete throughput (1 submitter) --
Phyriad: 200000 tasks submit→complete in 0.025s → 8.12 M tasks/s

-- §2 Taskflow tf::Executor — submit→complete throughput (1 submitter) --
Taskflow: 200000 tasks submit→complete in 0.054s → 3.68 M tasks/s

-- §3 Phyriad pool — 4 concurrent submitters × 50K (fire-and-forget) --
Phyriad: 4×50000 submit→complete in 0.019s → 10.44 M tasks/s aggregate

-- §3 Taskflow tf::Executor — 4 concurrent submitters × 50K → complete --
Taskflow: 4×50000 submitted=200000 wall=0.084s → 2.37 M op/s aggregate

[DONE] pool_vs_taskflow

########## pool_vs_concurrencpp ##########

=== pool_vs_concurrencpp ===
Phyriad pool vs concurrencpp thread_pool_executor, apples-to-apples (V2-P3 honest)
TSC frequency: 4.245 GHz (invariant ref-clock)
Compiler: gcc 13.3.0

--- environment (record this with any published number) ---
compiler: gcc 13.3.0
c++ std: 202100
os: Linux
compiled ISA: AVX512F AVX2 FMA BMI2 SSE4.2
build: Release (NDEBUG)
cpu: 32 logical / 32 physical cores, 1 CCD(s), V-Cache cores=32, max 0 MHz
TSC: nominal 0.000 GHz (CPUID) / calibrated 4.245 GHz — invariant ref-clock (NOT core cycles)
affinity: 0x000000000000FFFF ; current core=6

-- §1 Phyriad pool — submit + wait_result round-trip (1 submitter) --
Phyriad pool round-trip (try_submit) 0.90 M op/s 1108.12 ns/op 4704.6 ref-cyc/op (0.055 s total)

-- §1 concurrencpp thread_pool_executor — submit + .get() (1 submitter) --
concurrencpp submit+.get() round-trip 0.06 M op/s 16031.34 ns/op 68000.6 ref-cyc/op (0.802 s total)

-- §2 Phyriad pool — submit→complete throughput (1 submitter) --
Phyriad: 200000 tasks submit→complete in 0.015s → 13.24 M tasks/s

-- §2 concurrencpp thread_pool_executor — submit→complete (1 submitter) --
concurrencpp: 200000 tasks submit→complete in 1.654s → 0.12 M tasks/s

-- §3 Phyriad pool — 4 concurrent submitters × 50K (fire-and-forget) --
Phyriad: 4×50000 submit→complete in 0.016s → 12.22 M tasks/s aggregate

-- §3 concurrencpp — 4 concurrent submitters × 50K (post) --
concurrencpp: 4×50000 submit→complete in 0.367s → 0.54 M tasks/s aggregate

[DONE] pool_vs_concurrencpp

########## pn1_vs_grpc ##########

=== bench_pn1_vs_grpc ===
PhyriadNet/1 vs gRPC unary RPC (loopback echo)
TSC frequency: 4.221 GHz (invariant ref-clock)
Compiler: gcc 13.3.0

--- environment (record this with any published number) ---
compiler: gcc 13.3.0
c++ std: 202100
os: Linux
compiled ISA: AVX512F AVX2 FMA BMI2 SSE4.2
build: Release (NDEBUG)
cpu: 32 logical / 32 physical cores, 1 CCD(s), V-Cache cores=32, max 0 MHz
TSC: nominal 0.000 GHz (CPUID) / calibrated 4.221 GHz — invariant ref-clock (NOT core cycles)
affinity: 0x000000000000FFFF ; current core=0
NOTE: NOT like-for-like — a thin UDP framing protocol vs a full TCP/HTTP2/protobuf RPC stack; PN1 wins partly because it does less.

--- §1 Latency (RTT, 2000 iters, 32 byte payload) ---
gRPC (TCP+HTTP/2+pb) min= 111932 p50= 157757 p99= 200985 max= 280441 mean= 165450 (ns)
PhyriadNet/1 (UDP) min= 29774 p50= 48577 p99= 56807 max= 104257 mean= 48690 (ns)
PN1 vs gRPC: 3.25x faster at p50

--- §2 Throughput (5000 iters) ---
gRPC 0.01 M op/s 173915.53 ns/op 733057.8 ref-cyc/op (0.870 s total)
PhyriadNet/1 (UDP) 0.02 M op/s 50005.05 ns/op 210522.7 ref-cyc/op (0.250 s total)
PN1 ratio: 3.48x of gRPC throughput

[DONE] bench_pn1_vs_grpc

Loading
Loading