Skip to content

[Draft] feat(vardiff): add in-process simulation framework + baseline regression tests#2154

Open
gimballock wants to merge 15 commits into
stratum-mining:mainfrom
marafoundation:vardiff/simulation-framework
Open

[Draft] feat(vardiff): add in-process simulation framework + baseline regression tests#2154
gimballock wants to merge 15 commits into
stratum-mining:mainfrom
marafoundation:vardiff/simulation-framework

Conversation

@gimballock
Copy link
Copy Markdown

Adds a deterministic in-process simulation framework that characterizes
any Vardiff implementation across the operational rate range, and
commits the current algorithm's measurements as a baseline for automated
regression testing.

The "vardiff fires too often on noise" / "vardiff doesn't react fast
enough" conversations have circled the same issues (#396 and adjacent)
without a way to settle questions empirically. This PR adds the missing
infrastructure: any future proposal can now produce a quantitative
delta against a fixed reference.

The finding that motivates this

Before the framework, "is the algorithm too noisy or too sluggish?" was
a matter of opinion. With it, the question is a table:

Reaction sensitivity to a -50% step change (probability of firing
within 5 minutes after the change):

share/min sensitivity
6 0.70
12 0.55
30 0.33
60 0.16
120 0.09

Higher share rates produce less responsive algorithms — counter to
expectations. The mechanism is mechanical (post-convergence delta_time
grows indefinitely, diluting the post-step signal in the cumulative
window) and surfaced clearly in the data despite taking months of
deployment observation to half-articulate. Full numbers and analysis
in vardiff_baseline.md.

This isn't a fix. It's the measurement that lets the fix be evaluated.

What's in the PR

3 commits, ~2000 LOC plus baseline data:

  1. feat(vardiff): inject Clock trait + add_shares trait method
    Minimum API additions to channels_sv2 for testability and
    simulation performance. Production behavior unchanged — existing
    constructors default to SystemClock, the new trait method has a
    default implementation that keeps existing impls compiling.

  2. feat(vardiff_sim): in-process simulation framework
    New crate at sv2/channels-sv2/sim/. Per-tick Poisson share
    sampling, five behavioral metrics with percentile distributions,
    50-cell parameterized sweep (5 share rates × 10 scenarios), CLI
    binary for baseline generation, regression test asserting against
    a committed baseline.

  3. data(vardiff_sim): design doc + baseline characterization
    The design proposal documenting metric definitions and tolerance
    policy, plus the measured baseline as both TOML (consumed by the
    regression test) and Markdown (for human review).

What the framework measures

Five behavioral attributes, each as a distribution across 1000
independent trials per cell:

Metric Better is What it tells you
Convergence time Smaller How fast the algorithm settles after cold start
Settled accuracy Smaller How close to truth the algorithm lands
Steady-state jitter Smaller How often it fires on noise post-settle
Reaction time Smaller How fast it responds to genuine load changes
Reaction sensitivity ≈ 1 for real Δ, ≈ 0 for noise Whether it distinguishes signal from noise

Per-metric tolerances are asserted automatically against the checked-in
baseline. Failed assertions identify the cell and metric with specific
baseline-vs-current numbers. Mid-range Δ values (10-25%) are reported
but not asserted — that's where legitimate algorithmic tradeoffs live
and a reviewer should judge by looking at the full delta.

How to run

From sv2/channels-sv2/sim/:

# Fast unit tests (~1 second)
cargo test

# Generate a fresh baseline (~5-15 seconds)
cargo run --release --bin generate-baseline

# Run the slow regression test (~5-15 seconds; #[ignore]-d by default)
cargo test --release --lib -- --ignored

What this enables

For any future vardiff proposal:

  1. Implement the new algorithm as a Vardiff impl
  2. cargo run --release --bin generate-baseline to produce comparable
    measurements
  3. Diff against the committed baseline
  4. Make the case with numbers

No more "I think this is better." Instead "this changes p50 jitter from
X to Y at 12 spm, at the cost of p90 reaction time going from A to B."

Where to look in this PR

What this PR is NOT

  • Not an algorithm change. VardiffState behavior is unchanged.
    The only public-API additions are Vardiff::add_shares (with a
    default impl) and the Clock trait. Production code defaults to
    SystemClock and behaves identically to before.
  • Not a recommendation about share rate defaults. The baseline
    data suggests 12-30 spm is the operational sweet spot, but this
    PR doesn't touch any defaults.
  • Not a CI workflow. The regression test works locally but needs
    a GitHub Action to be a true CI gate. Follow-up.

Open follow-ups

  • Wire cargo test --release --lib -- --ignored into CI on PRs
    touching vardiff/* or the sim crate.
  • Bump channels_sv2 5.0.0 → 5.1.0 once the workspace lockfile
    situation allows (the trait-method addition is technically a
    minor-version semver change). TODO comment in Cargo.toml tracks
    this.
  • Investigate the reactivity-degrades-with-rate finding. The framework
    surfaces the problem; fixing it is a separate proposal that this
    framework will be the right tool to evaluate.

Test plan

  • cargo test -p channels_sv2 --lib vardiff — 17 tests, all pass
  • cargo test from sv2/channels-sv2/sim/ — 53 fast unit tests
  • cargo test --release --lib -- --ignored from sim/ — slow
    regression test passes against committed baseline
  • cargo run --release --bin generate-baseline — reproduces the
    committed vardiff_baseline.toml byte-for-byte at the same seed

@gimballock gimballock force-pushed the vardiff/simulation-framework branch from 11b2560 to 88d8d1d Compare May 13, 2026 22:06
@gimballock
Copy link
Copy Markdown
Author

gimballock commented May 13, 2026

The code is cheap and only meant to demonstrate the feasibility, but the concept ack revolves around these points imo:

  • We can play dice with share-received events to simulate running the vardiff algorithms over arbitrary ranges of time. But we need to mock SystemTime::now() and add a way to bulk add new shares.
  • With fake time simulations we can do large scale vardiff trials of whatever metrics we want and contrast against correlated attributes like target shares-per-minute.
    • I was interested in convergence time, stable-state jitter, and convergence accuracy
    • But responsiveness to external change is also a key capability, (how fast to adjust to a 50% spike/dip in hashrate)
  • With this compilation of reproducible test results compiled into a profile we can use integration tests to lock in established performance thresholds and ratchet up the expectations if we find better algorithms.

Comment on lines +7 to +13
| share/min | rate | p10 | p50 | p90 | p99 |
| --- | --- | --- | --- | --- | --- |
| 6 | 83.3% | 10m | 12m | 21m | 25m |
| 12 | 95.4% | 10m | 10m | 20m | 25m |
| 30 | 99.5% | 10m | 10m | 15m | 25m |
| 60 | 100.0% | 10m | 10m | 10m | 20m |
| 120 | 100.0% | 10m | 10m | 10m | 15m |
Copy link
Copy Markdown
Author

@gimballock gimballock May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first row here shows results of the convergence time test for the the default case (6 spm).
The convergence times are between 10 and 25 minutes, with total failures to converge (w/in quiet_window_secs of simulated time) occurring 17% of the time!

the next few rows describe the results for faster share rates. the most extreme times (25m reduces to 15m) and the total failure cases generally disappear around 30 spm.

Comment on lines +15 to +37
## Settled accuracy (stable load, post-convergence)

`|final_hashrate / true_hashrate - 1|` at trial end. Smaller is better.

| share/min | p10 | p50 | p90 | p99 |
| --- | --- | --- | --- | --- |
| 6 | 0.0% | 4.9% | 23.6% | 70.3% |
| 12 | 0.0% | 0.0% | 12.3% | 26.9% |
| 30 | 0.0% | 0.0% | 0.8% | 15.6% |
| 60 | 0.0% | 0.0% | 0.0% | 3.1% |
| 120 | 0.0% | 0.0% | 0.0% | 0.0% |

## Steady-state jitter (fires per minute)

Post-convergence rate of vardiff fires. Smaller is better — ideal is zero under stable load.

| share/min | p50 | p90 | p99 | mean |
| --- | --- | --- | --- | --- |
| 6 | 0.000 | 0.200 | 0.385 | 0.059 |
| 12 | 0.000 | 0.077 | 0.217 | 0.019 |
| 30 | 0.000 | 0.000 | 0.067 | 0.002 |
| 60 | 0.000 | 0.000 | 0.000 | 0.000 |
| 120 | 0.000 | 0.000 | 0.000 | 0.000 |
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two metrics (proximity to true hashrate and post-converged adjustments) show a similar trend,
Lots of undesired behavior in the extreme cases (top 10%, top 1%) of 6 shares/min case that is alleviated at higher share rates.

Comment on lines +39 to +60
## Reaction time to a 50% drop (step at 15 min)

| share/min | reacted | p10 | p50 | p90 | p99 |
| --- | --- | --- | --- | --- | --- |
| 6 | 69.7% | 1m | 3m | 5m | 5m |
| 12 | 54.8% | 1m | 3m | 5m | 5m |
| 30 | 32.6% | 2m | 4m | 5m | 5m |
| 60 | 16.3% | 3m | 5m | 5m | 5m |
| 120 | 8.6% | 4m | 5m | 5m | 5m |

## Reaction sensitivity (P[fire within 5 min of step change])

| Δ% | 6 | 12 | 30 | 60 | 120 |
| --- | --- | --- | --- | --- | --- |
| -50% | 0.70 | 0.55 | 0.33 | 0.16 | 0.09 |
| -25% | 0.44 | 0.23 | 0.08 | 0.00 | 0.00 |
| -10% | 0.39 | 0.15 | 0.02 | 0.00 | 0.00 |
| -5% | 0.40 | 0.15 | 0.02 | 0.00 | 0.00 |
| +5% | 0.39 | 0.13 | 0.02 | 0.00 | 0.00 |
| +10% | 0.42 | 0.17 | 0.03 | 0.00 | 0.00 |
| +25% | 0.48 | 0.23 | 0.07 | 0.01 | 0.00 |
| +50% | 0.64 | 0.47 | 0.32 | 0.22 | 0.29 |
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tables show how long it takes for vardiff to respond to an unexpected change in hashrate. Where the changes are to either increase or decrease by proportional amounts anywhere from 5% to 50%.

The first table specifically looks at a 50% draw down showing that a full 30% of the time vardiff fails to adjust after 5 min. The next few rows show that the situation worsens at higher share rates, at 120 spm 91% of the trials failed to adjust after 5m.

The second table shows that this effect is basically the same for hashrate changes in the opposite direction and also that changes of lesser magnitude respond much more quickly.

Comment thread sv2/channels-sv2/sim/src/regression.rs Outdated
Comment on lines +16 to +24
//! - **Convergence rate**: `current >= baseline - 0.01`
//! - **Convergence p90**: `current <= baseline * 1.10`
//! - **Settled accuracy p50 / p90**: `current <= baseline * 1.15`
//! - **Jitter p50**: `current <= baseline + 0.02` (absolute; baseline can be near zero)
//! - **Jitter p95**: `current <= baseline * 1.25`
//! - **Reaction rate**: `current >= baseline - 0.02`
//! - **Reaction p50**: `current <= baseline * 1.20`
//! - **Sensitivity at large |Δ| (|Δ| >= 50%)**: `current >= baseline - 0.02`
//! - **Sensitivity at small |Δ| (|Δ| <= 5%)**: `current <= baseline + 0.05`
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convergence rate: Must be no more than 1% slower than the baseline convergence time
Convergence p90: The slowest 10% convergence times must be within 10% of the baseline's convergence time
Settled accuracy: must be within 15% of baseline's accuracy for the slowest 50% / 10%
Jitter p50/p95: must be within 2% and 25% of baseline
...etc.

You see the pattern, there are lots of magic thresholds in this portion of the code that are arbitrarily chosen at this point and fair game for analysis.

@gimballock gimballock force-pushed the vardiff/simulation-framework branch 2 times, most recently from 5cbed7c to 85d6f8b Compare May 17, 2026 14:48
@gimballock gimballock changed the title feat(vardiff): add in-process simulation framework + baseline regression tests [Draft] feat(vardiff): add in-process simulation framework + baseline regression tests May 17, 2026
@adammwest
Copy link
Copy Markdown

after some optimization I got
https://github.com/adammwest/stratum/blob/feat/vardiff_kalman/sv2/channels-sv2/sim/vardiff_best.md
with the Bayesian model

Method Result
Bayesian model Good, Best result
Kalman filters Good
Jurik moving averages Good but artifacts on hashrate changes
Thompson sampling Bad

@gimballock
Copy link
Copy Markdown
Author

gimballock commented May 18, 2026

after some optimization I got https://github.com/adammwest/stratum/blob/feat/vardiff_kalman/sv2/channels-sv2/sim/vardiff_best.md with the Bayesian model
Method Result
Bayesian model Good, Best result
Kalman filters Good
Jurik moving averages Good but artifacts on hashrate changes
Thompson sampling Bad

I'm so excited to see people other people nerding out on vardiff with me! Thank you!

A couple things I noticed in your results, the 2m convergence time is impressive but your response to a 50% hashrate drop only succeeds in readjusting 4.4% of the time. I'm not sure how best to balance those two metrics but probably not one at the expense of the other.

@gimballock gimballock force-pushed the vardiff/simulation-framework branch from 2d10f57 to 414afbb Compare May 19, 2026 14:25
@gimballock gimballock force-pushed the vardiff/simulation-framework branch 4 times, most recently from 211bc98 to 2a88fde Compare May 20, 2026 15:09
@adammwest
Copy link
Copy Markdown

Some learning's I had @gimballock

  • This task is hard, I think the current implementation is optimized to a degree.
  • the most critical thing is the fitness, currently there are many metrics, how you combine all of them into a final value is what determines the goodness for any algorithm, as its a summary there are ways to game it.
  • Use every value in the toml file, if you dont those values will naturally will degrade.
  • There are many ways to combine many numbers which lead to slightly better and slightly worse performance.
    one thing I did which helped was to separate fitness into 2 categories improvement and regression, even better separating these per group e.g stable,coldstart ,... then you can decompose the value.
  • Normalize each metric/group otherwise the more numerous or larger numbers will be the focus.
  • There are many cases where you can get a good score, but the fitness prefers optimizing 1 variable or a set of variables at the expense of others.
  • For grid parameter sweeps, they usually discretize the domain so you are bounded in improvement only by dimension range and amount of queries. so you need to constantly increase queries or shrink ranges. usually you are limited due to time. For this reason I prefer random restart hill climbing I find is generally pretty good when you don't make assumptions about the data.
  • If you have too many parameters to optimize you can over fit, and end up just gaming the test

Eric Price and others added 10 commits May 21, 2026 10:12
Two API additions to enable mockable time and bulk share-count operations
in the Vardiff trait, prerequisites for the in-process simulation
framework added in subsequent commits.

Clock injection:
- New vardiff/clock.rs with Clock trait, SystemClock, and MockClock.
- VardiffState gains an Arc<dyn Clock> field and a new_with_clock
  constructor. reset_counter and try_vardiff read time via the clock
  rather than calling SystemTime::now() directly.
- Existing constructors (new, new_with_min) default to SystemClock;
  production behavior is unchanged.

Bulk share addition:
- Vardiff trait gains add_shares(n: u32) with a default implementation
  calling increment_shares_since_last_update n times.
- VardiffState overrides with a single saturating add. Required for
  simulation performance — the harness can bulk-add millions of shares
  per tick during cold-start scenarios where the default's loop would
  dominate trial runtime.

VardiffError::TimeError is now unreachable but retained with a doc
comment marking it for removal at the next major version bump; removing
it now would break downstream exhaustive matches.

Semver note: channels_sv2 should bump from 5.0.0 to 5.1.0 to surface the
new add_shares method to downstream consumers, but the project's pinned
Rust 1.75 toolchain cannot write the v4 Cargo.lock format that a version
change requires. TODO comment in Cargo.toml flags the deferred bump.

Tests: 17 vardiff tests pass (12 existing unchanged, 3 new clock-module
unit tests, 2 new tests verifying clock injection propagates through
VardiffState).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…terization

New vardiff_sim crate at sv2/channels-sv2/sim/ providing deterministic
behavioral characterization of any Vardiff implementation, plus a
regression test that asserts the current algorithm against a
checked-in baseline.

Purpose: surface the operationally-important attributes of the vardiff
algorithm — convergence time, settled accuracy, steady-state jitter,
reaction time, reaction sensitivity — in concrete measurable terms so
that any future algorithmic improvement (parametric thresholds, EWMA,
SPRT, etc.) can be evaluated against a fixed harness and produce a
clean delta report.

Components:

- rng.rs: XorShift64 RNG plus exponential and Poisson samplers
  (Knuth for λ<30, normal approximation for ≥30). Hand-rolled for
  cross-version reproducibility without depending on the rand crate's
  RNG-stability guarantees.

- schedule.rs: HashrateSchedule for parameterizing the miner's true
  hashrate over time. Convenience constructors for stable, step-change,
  and throttle scenarios.

- trial.rs: run_trial drives any Vardiff implementation through
  duration_secs of simulated time. Per-tick Poisson sampling: at each
  60s tick, samples (true_h / estimated_h) * shares_per_minute, bulk-
  adds via Vardiff::add_shares, calls try_vardiff. Rate-independent —
  handles λ from near-zero to millions.

- metrics.rs: Distribution helper (sorted f64s, p10-p99 percentiles,
  mean, count) plus the five metric functions. Where a metric can fail
  (non-converging trials, missing reactions) the rate is reported
  alongside the distribution.

- baseline.rs: Scenario / Cell / CellResult types and run_baseline
  generic over Vardiff. Default grid is 5 share rates × 10 scenarios =
  50 cells. Hand-written TOML and Markdown serialization (avoiding
  serde + toml dependencies to keep the lockfile minimal).

- bin/generate-baseline.rs: CLI entry point. Configurable via
  VARDIFF_BASELINE_TRIALS, VARDIFF_BASELINE_SEED, VARDIFF_BASELINE_OUT_DIR.

- regression.rs: baseline-parsing + per-metric tolerance assertions.
  The classic_algorithm_no_regression test loads the committed baseline
  via include_str! and asserts current measurements. Marked #[ignore]
  because it runs the full ~5s baseline; CI should invoke via
  cargo test --release --lib -- --ignored.

- README.md covering usage, output interpretation, baseline-update
  workflow, and project-specific notes including the Cargo.lock
  copy-from-parent rationale.

The crate is declared as its own Cargo workspace (its Cargo.toml has a
top-level [workspace] section) so its lockfile is independent of the
parent stratum workspace. Required because the parent's pinned 1.75
toolchain cannot write v4 lockfiles, and adding the sim crate as a
workspace member would force such a write. The committed Cargo.lock is
a copy of the parent's.

Tests: 53 fast unit tests + 1 #[ignore]-d slow regression test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tate

- VARDIFF_SIMULATION_FRAMEWORK.md: design proposal documenting the
  framework's five metrics, assertion policy, simulation mechanism,
  and architectural rationale. Co-located with the crate it describes.

- vardiff_baseline.toml: machine-readable baseline measurements of
  the classic VardiffState algorithm across the default 50-cell grid
  (5 share rates × 10 scenarios, 1000 trials each, base seed
  0xDEADBEEFCAFEF00D). Consumed by the regression test in the sim
  crate.

- vardiff_baseline.md: human-readable summary of the same data,
  organized by metric type for PR review.

Notable findings surfaced by the baseline:

- Convergence: solid across rates (100% at 30+ spm, 95% at 12 spm,
  83% at 6 spm). p50 is ~10 minutes everywhere, dominated by the
  Phase 1 ×3/min ramp clamp.

- Settled accuracy: follows 1/sqrt(N) cleanly. p99 error is 70% at
  6 spm, 27% at 12, 15% at 30, 3% at 60, 0% at 120. Low-rate
  operation is statistically threadbare.

- Steady-state jitter: small everywhere and ~0 above 30 spm. The
  algorithm's growing delta_time post-convergence narrows the
  effective noise band as 1/sqrt(N), producing accidental self-
  stabilization at high rates.

- Reaction sensitivity DEGRADES with share rate — counterintuitive
  but mechanistic. The same property that produces low jitter at
  high rates (growing delta_time after a Phase 1 fire) produces
  sluggish response to step changes (post-step shares diluted by
  long pre-step history). At 60+ spm only 9-16% of trials react to
  a 50% drop within 5 minutes.

This baseline is the reference point for evaluating any future
algorithmic proposal. The regression test in the sim crate asserts
each metric is within tolerance of these recorded values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reframes the sim crate around a four-axis algorithm decomposition
(Estimator / Statistic / Boundary / UpdateRule) and produces a
production-ready algorithm recommendation via systematic
characterization of the design space.

## What the framework now provides

- Four-axis decomposition (sim/src/composed/): four orthogonal
  extension traits + Composed<E, S, B, U> adapter that carries a
  blanket impl Vardiff, so any composition is a drop-in production
  algorithm. The classic algorithm composed as
  Composed<CumulativeCounter, AbsoluteRatio, StepFunction(classic),
  FullRetargetWithClamp(classic)> is asserted fire-for-fire
  equivalent to channels_sv2::VardiffState.

- Algorithm registry (AlgorithmSpec factories in sim/src/grid.rs):
  VardiffState, ClassicComposed, Parametric (PoissonCI boundary),
  ParametricStrict (z=3.0), EWMA-60s / EWMA(tau), SlidingWindow,
  ClassicPartialRetarget, and FullRemedy (EWMA-120 + PoissonCI +
  PartialRetarget(0.3)) -- the recommended production composition.

- Grid layer (sim/src/grid.rs): Cartesian product over (algorithm,
  share_rate, scenario) with first-class algorithm axis. Grid::run
  characterizes every algorithm in one binary; Grid::run_paired uses
  algo-stripped seeds for clean A/B comparison.

- Metric registry (sim/src/metrics.rs): Metric trait + 7 per-cell
  metrics (convergence, settled_accuracy, jitter, reaction_time,
  bias, variance, ramp_target_overshoot) + DerivedMetric trait + 2
  cross-cell metrics (decoupling_score, reaction_asymmetry). Each
  metric owns its computation, applicability, tolerance, and
  Markdown rendering -- adding a new metric is one new impl.

- CI-aware regression (sim/src/regression.rs): Tolerance::WithinCi
  with Direction::{HigherIsBetter, LowerIsBetter, Either} checks the
  current value against the baseline's bootstrap CI envelope plus a
  per-metric absolute/multiplicative slack. Statistical noise is the
  floor; engineering tolerance is the deliberate budget on top.
  Baseline TOML carries <key>_ci_low / <key>_ci_high for every
  percentile, computed with a deterministic bootstrap.

- Trial recording (sim/src/trial.rs): dense per-tick TickRecord with
  optional delta/theta/H~ fields populated for Observable
  algorithms. Enables bias / variance / overshoot metrics.

- Scenario DSL (sim/src/baseline.rs): Phase::{Hold, Ramp, Stall}
  primitives compose into scenarios. Named scenarios kept for
  ergonomics; new ones add as Scenario::Custom { ... }.

- Investigation binaries:
  - compare-algorithms: 5x10 grid x N trials x every algorithm
  - sweep-ewma-tau: tau Pareto sweep with ramp-overshoot tables
  - trace-trial: single-trial tick-by-tick inspection, with
    --scan-overshoot N to surface worst-case seeds

## Empirical results

docs/FINDINGS.md records the cross-algorithm characterization.
Headline:

- VardiffState's variance-vs-detection paradox: reaction rate to a
  -50% step is 70% at SPM=6 but only 9.8% at SPM=120. The share-
  rate-blind threshold ladder is fundamentally miscalibrated at
  high SPM.

- The SPM=6 cascade: at cold-start, a single Poisson(5.2)->15
  outlier at tick 11 (after the Phase 1 ramp lands current_h near
  truth) produces a 187% target overshoot under VardiffState /
  Parametric. No single-axis change (stricter z, PartialRetarget
  alone, EWMA alone) closes both the Phase-1 cascade AND keeps
  settled accuracy in check -- only the three-axis composition
  FullRemedy does.

- FullRemedy validation: full 5x10x1000 grid shows it dominates
  every other algorithm on convergence rate, convergence speed,
  reaction rate, reaction sensitivity, ramp overshoot p99, and
  decoupling score -- at every share rate. Two well-bounded
  trade-offs: ~2.7 stable-load fires/hour at SPM=6 (active
  tracking) and a mild negative cold-start bias (EWMA lag during
  ramp).

## Production fix

channels_sv2/src/target.rs: U256 precision in hash_rate_from_target.
The intermediate '60 / share_per_min * 100' truncated to integer
before the U256 scale, losing ~5 digits of precision at SPM=30 (49%
inflation in the round-trip estimate at low hashrate). Fix: scale
factor 100 -> 100_000 at target.rs:184 and the matching
'from(100) -> from(100_000)' at line 197. Regression-tested by
hash_rate_round_trip_is_precise_after_u256_fix in the sim crate.

## Documentation

- docs/DESIGN.md: architectural reference (the four axes,
  alternative decompositions considered, trait surface, algorithm
  registry, composition argument, production migration plan).
- docs/FINDINGS.md: characterization results (cross-algorithm
  Pareto, SPM=6 cascade mechanism, axis-isolation experiments,
  FullRemedy validation, EWMA tau Pareto, asymmetric step
  response).
- README.md: entry point with the algorithm registry and running
  instructions.

## Migration path

Production Vardiff trait unchanged. Migrating to FullRemedy is:
1. Promote composed/ types from sim into channels_sv2::vardiff.
2. Add a VardiffState::production_default() returning the
   FullRemedy composition.
3. Update production tests that depended on Classic's fire-for-fire
   trajectory.
4. Wire 'cargo test --release -- --ignored' into CI.

See docs/DESIGN.md sect "Production migration" for the full plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lumns + CI workflow

Extends the framework with operational polish — readability,
statistical context, and CI gating — on top of the four-axis
decomposition shipped in the prior commit.

## Pareto report (compare-algorithms now emits pareto.md)

Single side-by-side cross-algorithm Markdown report. One section per
summary spec, rows = SPMs, columns = algorithms in canonical order
(VardiffState → ClassicComposed → Parametric → ParametricStrict →
ClassicPartialRetarget → EWMA → SlidingWindow → FullRemedy). Winner
bolded per row with a 5%-relative tie window. The reviewer can see
which algorithm wins on each metric × share rate without diff-ing 8
files.

## fundamental_limit floors in TL;DR

The per-algorithm baseline_*.md Summary section now shows the
Poisson-noise lower bound alongside observed values where applicable:

  | settled accuracy p50 (stable) | down | 0.0% @ SPM=120 (floor: 1.9%) | 9.3% @ SPM=6 (floor: 8.7%) |

Floors come from each metric's `fundamental_limit(cell, key)`
override (SettledAccuracy, Jitter, RampTargetOvershoot).
Order-of-magnitude reference for whether an algorithm is near-optimal
or has structural room for improvement.

## CI bounds on rate metrics + derived-metric propagation

ConvergenceTime and ReactionTime now record bootstrap-style 95% CI
bounds on the rate (normal-approximation Wilson interval on the
proportion). DecouplingScore and ReactionAsymmetry propagate these
through their derived computations via worst-case substitution:

- decoupling_score CI: `r_lo × (1 − j_hi/J_max)` … `r_hi × (1 − j_lo/J_max)`
- asymmetry_at_X CI: `r(+X)_lo − r(−X)_hi` … `r(+X)_hi − r(−X)_lo`

The comparator already understands CI bounds (Wave 2's
Tolerance::WithinCi), so derived-metric regression checks are now
statistically aware end-to-end.

## reaction_asymmetry: compress to single TL;DR row

Added a `max_abs_asymmetry` aggregate (largest |asymmetry| across
δ ∈ {5, 10, 25, 50} per share rate). One headline row in the TL;DR
replaces what would otherwise be 4 separate rows. Per-magnitude data
still emitted to TOML for deep-dive analysis.

## MD column normalization

Every percentile-emitting metric now uniformly emits
p10 / p50 / p90 / p99 / mean in both TOML and MD:

  - Convergence: added mean
  - SettledAccuracy: added mean
  - Jitter: added p10
  - ReactionTime: added mean
  - Bias / Variance: added p99, reordered to p10/p50/p90/p99/mean
  - RampTargetOvershoot: added p10 + mean

Consistent column shape across the report; reviewers stop
recalibrating column structure per section.

## CI workflow

New .github/workflows/vardiff-sim.yaml runs on PRs touching the sim
crate (path filter so unrelated PRs don't pay the cost):

  - test-fast: cargo test --lib (~1s)
  - test-regression: cargo test --release --lib -- --ignored
    (~15-20s, asserts current algorithm against the checked-in
    baseline_VardiffState.toml with CI-aware tolerance budgets)
  - build-binaries: cargo build --release --bins

## Documentation

- docs/DESIGN.md: cross-links to pareto.md as the auto-generated
  cross-algorithm comparison reference.
- docs/FINDINGS.md: section 1 references pareto.md as the
  authoritative comparison data rather than carrying redundant
  tables.

## Baselines regenerated

All baseline_<Algorithm>.{md,toml} regenerated with the new keys
(rate CIs, max_abs_asymmetry, p10/mean on RampTargetOvershoot,
mean on Convergence/Settled/Reaction). pareto.md also regenerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Moves the four-axis algorithm decomposition (Estimator, Statistic,
Boundary, UpdateRule + Composed adapter + DecisionRecord) from
`vardiff_sim::composed` to `channels_sv2::vardiff::composed`. The
production crate now ships the building blocks directly; the sim
crate becomes a thin characterization layer over them.

Additive change to channels_sv2's public API; no breaking change to
existing consumers. The `Vardiff` trait, `Clock` injection, and
`VardiffState` API are unchanged.

## What moves where

- `channels_sv2::vardiff::composed::{Estimator, Statistic, Boundary,
  UpdateRule}` -- the four extension traits
- `channels_sv2::vardiff::composed::Composed<E, S, B, U>` -- the
  adapter that carries `impl Vardiff` for any four-axis composition
- `channels_sv2::vardiff::composed::{StepFunction, PoissonCI,
  CumulativeCounter, EwmaEstimator, SlidingWindowEstimator,
  AbsoluteRatio, FullRetargetWithClamp, FullRetargetNoClamp,
  PartialRetarget}` -- the shipped concrete impls
- `channels_sv2::vardiff::composed::DecisionRecord` -- the per-tick
  introspection struct (`pub` so sim can extend with `Observable`)
- `channels_sv2::vardiff::composed::{classic_composed,
  ClassicComposed}` -- the specific composition asserted fire-for-fire
  equivalent to `VardiffState`

## New production factory

`VardiffState::production_default(min_hashrate, clock) -> Box<dyn Vardiff>`
returns the recommended `FullRemedy` composition:

  EwmaEstimator(120s) + AbsoluteRatio + PoissonCI(z=2.576, margin=0.05)
  + PartialRetarget(eta=0.3)

Empirically dominates the classic `VardiffState` on every
operationally meaningful metric across the canonical 5 x 10 grid.
See `sim/docs/FINDINGS.md` sect 1 + 4 for the validation case and
`sim/docs/DESIGN.md` for the architectural rationale.

Trade-offs (`FINDINGS.md` sect 5): ~2.7 stable-load fires/hour at
SPM=6 (active tracking, not flicker) and a mild negative cold-start
bias (EWMA lag during ramp -- harmless or beneficial since it
accelerates share arrival). Both well-bounded.

## Sim crate becomes a re-export shim

- `sim/src/composed/` directory replaced by a flat `sim/src/composed.rs`
  that re-exports `channels_sv2::vardiff::composed::*` plus the
  sim-only `impl Observable for Composed<E, S, B, U>` extension
  (orphan-rule compatible: `Observable` is local to sim) and the
  fire-for-fire equivalence-test suite (which still lives in sim
  because it depends on the sim trial driver).
- `sim/src/trial.rs` re-exports `DecisionRecord` from production
  rather than defining it locally.

All existing sim-internal callers (`use crate::composed::*`) keep
working unchanged.

## Migration path for production consumers

Existing call sites holding `Box<dyn Vardiff>` or `impl Vardiff`
need no source changes. To opt into the new algorithm:

  // Before
  let v = VardiffState::new_with_clock(min_h, clock)?;

  // After (recommended)
  let v = VardiffState::production_default(min_h, clock);

Both produce a valid `Vardiff` implementation; the latter is the
behaviorally improved composition.

## Baselines regenerated

All `baseline_<Algorithm>.{md,toml}` and `pareto.md` regenerated
against the post-migration code paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Polish-only changes after the production migration:
- vardiff-sim.yaml: rustup override per job so the sim crate compiles
  on stable (workspace root pins 1.75; sim needs newer).
- clippy: address type_complexity (MetricEntry alias), doc_lazy_continuation
  in grid.rs, unnecessary_get_then_check in metrics.rs, and silence
  unnecessary_map_or in baseline.rs (Option::is_none_or requires 1.82+).
- rustfmt across the touched sim and production composed files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… sweep binaries

- Add AlgorithmSpec::full_remedy_with(tau, eta, z) parameterized constructor.
- Add bin/sweep-eta, bin/sweep-z, bin/sweep-eta-z for per-axis and joint
  Pareto characterization.
- Retune FullRemedy default from η=0.3 to η=0.2 based on joint (η, z) sweep:
  η=0.2 captures the overshoot-tail reduction (ramp p99 at SPM=6: 31% → 12%)
  and decoupling gains (0.79 → 0.87 at SPM=6) while preserving cold-start
  convergence at ≥99% across every SPM. Smaller η would break convergence
  catastrophically (η=0.1 produces 48% convergence at SPM=120).
- Regenerate all per-algorithm baselines, pareto.md, z_sweep.md at the new
  default.
- Update FINDINGS.md §§ 1, 3, 4, 5 with the new numbers and characterization
  narrative. DESIGN.md, production crate docs, and sweep-binary docstrings
  updated for consistency.
…te::new

Make FullRemedy the recommended production vardiff via free factory
functions in vardiff/mod.rs (default, default_with_min, default_with_clock).
Deprecate the implicit-pick VardiffState constructors (new, new_with_min)
with a migration note pointing at the new factories; new_with_clock stays
non-deprecated as the explicit opt-in to the classic threshold-ladder
algorithm for simulation, characterization, and testing.

This gives downstream consumers a clear, semver-safe migration path:
existing code keeps compiling with a deprecation warning that links to
the new factory; new code is steered to FullRemedy by default.
The *100 -> *100_000 scaling fix in hash_rate_from_target lowered the
safe target ceiling from ~2^246.4 to ~2^236.4, because the intermediate
product `(t+1) * shares_occurrency_frequence` is computed in U256.
Routine vardiff-driven targets at low realized share rates push above
the new boundary, and broadcast SetTarget messages carrying a channel's
`requested_max_target` (~2^253) trip it every time.

In production, this surfaced on a slot running translator_sv2 with
vardiff disabled: every upstream SetTarget logs

  WARN: Failed to derive hashrate from SetTarget target:
        ArithmeticOverflow (channel_id=4294967295)

and the translator's SV1 hashrate gauge stops updating.

The fix widens the multiply step to U512 so the intermediate product
fits regardless of target magnitude. The numerator stays in U256 (it is
2^256 - t, which always fits), and the final result narrows back to
u128 via low_u128() exactly as before. The precision improvement from
the *100_000 scaling is preserved.

Two regression tests:

- target.rs: pin three real `maximum_target` values captured from the
  affected slot's translator log, including the channel's
  `requested_max_target` (0x1745d174_5d1745d1...). All previously
  errored ArithmeticOverflow; they now return finite hashrates.

- vardiff/test/mod.rs: update the
  test_try_vardiff_with_less_spm_than_expected_classic expected values
  for the 240s and 300s checkpoints. Upstream's 74.2 / 62.327995
  values came from the `try_vardiff` Err-fallback path
  (`hashrate * realized_spm / shares_per_minute`) which only ran
  because hash_rate_from_target overflowed at those high targets.
  With overflow eliminated, the main path returns the integer-
  truncated 74.0 / 62.0 from low_u128().
Eric Price and others added 3 commits May 21, 2026 10:12
…metrics

Complete redesign of the vardiff framework and a new production algorithm
that dominates the legacy implementation on every operational metric.

## New Algorithm: AdaCUSUM

VardiffState::new() now returns the AdaCUSUM composition internally:
  EwmaEstimator(120s) + AdaptiveCusumBoundary(s=1.5, f=0.05) + PartialRetarget(0.5)

Zero downstream API changes — sv2-apps continues calling VardiffState::new().

Performance vs legacy VardiffState at SPM=12:
  - Convergence: 6 min vs 10 min (40% faster)
  - React to -10% decline: 62% vs 14% (4.4x better)
  - React to -50% decline: 99% vs 55% (near-perfect)
  - Jitter: 0.107/min vs 0.018/min (acceptable trade-off)
  - Overshoot: 26% vs 87% (3.3x less)

## Architecture: Three-Stage Pipeline

  Estimator → Boundary → UpdateRule

- Statistic trait removed (deviation is inline arithmetic)
- Composed<E, S, B, U> → Composed<E, B, U>
- Estimator::reset() → on_fire(new_hashrate, old_hashrate)
- EstimatorSnapshot gains uncertainty: Option<Uncertainty>
- Boundary receives &EstimatorSnapshot (uncertainty-aware)
- UpdateRule receives threshold (margin-aware)
- EwmaEstimator::on_fire rescales instead of zeroing

## operational_fitness Metric

  fitness = 0.25 × reaction_rate(-10%)
          + 0.20 × reaction_rate(-50%)
          + 0.20 × clamp(1 - jitter/0.30, 0, 1)
          + 0.25 × convergence_rate × clamp(1 - conv_p50/600s, 0, 1)
          + 0.10 × clamp(1 - overshoot_p99, 0, 1)

## Components Explored

New estimators: BayesianEstimator, KalmanEstimator
New boundaries: CredibleIntervalBoundary, CusumBoundary, AdaptiveCusumBoundary
New update rules: AdaptivePartialRetarget
New metrics: FireDecisiveness, StepCorrection, OperationalFitness

All 104 channels_sv2 + 141 sim tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds AsymmetricCusumBoundary that uses different thresholds based on
whether firing would tighten or ease difficulty:

- Easing (miner slowing): uses base threshold (fire quickly, free action)
- Tightening (miner speeding): uses base × tighten_multiplier (cautious,
  because tightening rejects in-flight shares)

Rationale: SetTarget that makes difficulty harder invalidates shares
already being computed by miners. SetTarget that makes difficulty easier
has zero cost — old harder work is still valid under the new easier target.

Results with tighten_multiplier=3.0 (AsymCUSUM-t30):
  Mean fitness: 0.751 (vs symmetric AdaCUSUM 0.676, FullRemedy 0.565)
  Jitter: 0.045/min (vs 0.175 symmetric, 58% reduction)
  Convergence: 4 min (vs 6 min symmetric, 33% faster)
  Overshoot: 16.6% (vs 26.4% symmetric, 37% less)
  Detection -50%: 99.0% (unchanged)
  Detection -10%: 44.3% (vs 61.9% — the cost of asymmetry)

The jitter reduction directly translates to fewer share rejections in
production: ~3 costly tighten-fires per hour instead of ~6.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ran cargo fmt to fix line-length violations introduced during rebase.
Regenerated baseline_VardiffState.toml to reflect the current
AsymmetricCusumBoundary behavior (directional cost awareness produces
expected negative asymmetry values at higher SPM rates).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gimballock gimballock force-pushed the vardiff/simulation-framework branch from 63a19d0 to a18c3a3 Compare May 21, 2026 14:28
@gimballock
Copy link
Copy Markdown
Author

gimballock commented May 21, 2026

Thanks for these insights @adammwest — especially the point about fitness decomposition and normalization. A lot of what you're describing matches the evolution I've gone through on this PR, so let me give a timeline of
how the approach has matured:

Phase 1: Basic metrics + simulation harness

Initially I focused on three metrics I thought were important: convergence time, jitter, and accuracy. These were evaluated via a time-compressed simulation that replays a synthetic share stream through the vardiff
algorithm. This gave us reproducible, large-scale trials (50 cells × 1000 trials) against correlated attributes like target shares-per-minute.

Phase 2: Decomposed pipeline model

I wanted to make algorithm search more systematic, so I decomposed "a vardiff algorithm" into four independent, replaceable components: estimator, statistic, boundary, and decision rule. The idea was to mix-and-match
implementations at each slot for the best composite.

This model worked well for the classic algorithm, the parametric variant, and the EWMA approach. But when I tried to embed a Bayesian model, it broke down — the components aren't truly independent. There's a sequential
data flow: the estimator needs to communicate its belief to the boundary ("should we respond?") and to the update rule. Additionally, since vardiff triggers on a timer rather than on share arrival, the decision rule needs
to call back to the estimator to update state when adjustments occur. I also dropped the "statistic" component as it wasn't pulling its weight.

The resulting three-stage pipeline (Estimator → Boundary → UpdateRule) is what's in this PR. It successfully hosts the classic algorithm, EWMA, AdaCUSUM, and could host a Bayesian approach.

Phase 3: Aggregate fitness metric

To your point about "how you combine all metrics into a final value" — we now have a configurable aggregate metric that allows weighting across the underlying measurements. This addresses exactly the gaming concern you
raised: rather than optimizing one metric at the expense of others, we can define a weighted composite that represents our desired tradeoff. The regression baseline locks in the full vector of metrics so we catch
regressions in any dimension, not just the aggregate.

Your suggestion to separate fitness into improvement vs. regression categories per scenario group (stable, coldstart, reaction) is a good one. Currently the regression test does compare per-cell, so a coldstart regression
can't hide behind a stable-state improvement, but making this more explicit in the scoring would help.

Phase 4: Realistic operating conditions

After discussions with hardware engineers, I retuned the test scenarios to realistic share rates (2–30 spm instead of the earlier 6–120 range). The engineers confirmed that responding to partial hardware failures and
network slowdowns on an established channel is valuable functionality — even though in practice many hashrate changes currently cause miner reconnections (which resets vardiff anyway). We've been doing live testing with
physical miners on testnet4 and confirmed this pattern: when vardiff ramps difficulty too aggressively, it can interact with firmware timeout behaviors in ways that force reconnections, making reactivity testing harder
than expected.

Current direction

I've backed off from prioritizing convergence speed after seeing overcorrection in practice. The current focus is on:

  • Stability under steady-state (minimal oscillation once converged)
  • Reasonable reactivity (detect genuine changes within 2–3 retarget windows, not 1)
  • Asymmetric cost awareness — difficulty increases are more disruptive than decreases. An overshoot upward causes difficulty-too-low share rejections (wasted miner work), while an undershoot downward just means slightly
    more shares than optimal (cheap). The AsymmetricCusumBoundary encodes this: it requires stronger evidence before raising difficulty than lowering it. We can now actually measure the impact via the share-rejection metrics
    (shares_rejected_total{reason="difficulty-too-low"}) that were recently added to the pool's monitoring (sv2-apps PR Docs: Channel Factory #491).

On your point about normalization: agreed, and the per-metric tolerance budgets in the regression test (absolute slack + optional multiplicative slack) are our current mechanism for this. Open to suggestions on better
normalization approaches

Eric Price and others added 2 commits May 21, 2026 14:30
…magnitude metric

Update OperationalFitness weights to penalize algorithms that raise
difficulty as aggressively as they lower it:

  fitness = 0.25 × reaction(-10%) + 0.20 × reaction(-50%)
          + 0.20 × jitter + 0.15 × convergence
          + 0.10 × asymmetry_preference + 0.10 × overshoot

The asymmetry term rewards algorithms where reaction_rate(Step -10%)
exceeds reaction_rate(Step +10%) — i.e. faster detection of hashrate
drops than spikes. This encodes the operational insight that upward
difficulty adjustments are more disruptive (causing difficulty-too-low
share rejections) than downward ones.

New metric: upward_step_magnitude — tracks the ratio (new/old) of all
upward difficulty adjustments during steady state. The p95 value serves
as a deterministic proxy for difficulty-too-low rejection risk without
depending on stochastic share-rejection simulation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rity

Restructure OperationalFitness to reflect operational reality:
"never surprise the miner" matters more than "detect changes fast."

New weights (harm-avoidance 60%, reactivity 25%, convergence 15%):
  0.15 × reaction(-10%) + 0.10 × reaction(-50%)
  0.25 × jitter_control + 0.25 × step_magnitude_safety
  0.15 × convergence + 0.10 × overshoot_safety

The step_magnitude_safety term directly penalizes large upward
difficulty jumps: p95 step of 1.0× scores 1.0, 1.5× scores 0.0.

Result: FullRemedy now wins at 6-12 spm (realistic miner rates)
where its cautious approach correctly avoids the timeout death
spiral observed with physical miners. VardiffState still wins at
15-30 spm where aggressive reactivity is less harmful.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants