[Draft] feat(vardiff): add in-process simulation framework + baseline regression tests#2154
[Draft] feat(vardiff): add in-process simulation framework + baseline regression tests#2154gimballock wants to merge 15 commits into
Conversation
11b2560 to
88d8d1d
Compare
|
The code is cheap and only meant to demonstrate the feasibility, but the concept ack revolves around these points imo:
|
| | share/min | rate | p10 | p50 | p90 | p99 | | ||
| | --- | --- | --- | --- | --- | --- | | ||
| | 6 | 83.3% | 10m | 12m | 21m | 25m | | ||
| | 12 | 95.4% | 10m | 10m | 20m | 25m | | ||
| | 30 | 99.5% | 10m | 10m | 15m | 25m | | ||
| | 60 | 100.0% | 10m | 10m | 10m | 20m | | ||
| | 120 | 100.0% | 10m | 10m | 10m | 15m | |
There was a problem hiding this comment.
The first row here shows results of the convergence time test for the the default case (6 spm).
The convergence times are between 10 and 25 minutes, with total failures to converge (w/in quiet_window_secs of simulated time) occurring 17% of the time!
the next few rows describe the results for faster share rates. the most extreme times (25m reduces to 15m) and the total failure cases generally disappear around 30 spm.
| ## Settled accuracy (stable load, post-convergence) | ||
|
|
||
| `|final_hashrate / true_hashrate - 1|` at trial end. Smaller is better. | ||
|
|
||
| | share/min | p10 | p50 | p90 | p99 | | ||
| | --- | --- | --- | --- | --- | | ||
| | 6 | 0.0% | 4.9% | 23.6% | 70.3% | | ||
| | 12 | 0.0% | 0.0% | 12.3% | 26.9% | | ||
| | 30 | 0.0% | 0.0% | 0.8% | 15.6% | | ||
| | 60 | 0.0% | 0.0% | 0.0% | 3.1% | | ||
| | 120 | 0.0% | 0.0% | 0.0% | 0.0% | | ||
|
|
||
| ## Steady-state jitter (fires per minute) | ||
|
|
||
| Post-convergence rate of vardiff fires. Smaller is better — ideal is zero under stable load. | ||
|
|
||
| | share/min | p50 | p90 | p99 | mean | | ||
| | --- | --- | --- | --- | --- | | ||
| | 6 | 0.000 | 0.200 | 0.385 | 0.059 | | ||
| | 12 | 0.000 | 0.077 | 0.217 | 0.019 | | ||
| | 30 | 0.000 | 0.000 | 0.067 | 0.002 | | ||
| | 60 | 0.000 | 0.000 | 0.000 | 0.000 | | ||
| | 120 | 0.000 | 0.000 | 0.000 | 0.000 | |
There was a problem hiding this comment.
These two metrics (proximity to true hashrate and post-converged adjustments) show a similar trend,
Lots of undesired behavior in the extreme cases (top 10%, top 1%) of 6 shares/min case that is alleviated at higher share rates.
| ## Reaction time to a 50% drop (step at 15 min) | ||
|
|
||
| | share/min | reacted | p10 | p50 | p90 | p99 | | ||
| | --- | --- | --- | --- | --- | --- | | ||
| | 6 | 69.7% | 1m | 3m | 5m | 5m | | ||
| | 12 | 54.8% | 1m | 3m | 5m | 5m | | ||
| | 30 | 32.6% | 2m | 4m | 5m | 5m | | ||
| | 60 | 16.3% | 3m | 5m | 5m | 5m | | ||
| | 120 | 8.6% | 4m | 5m | 5m | 5m | | ||
|
|
||
| ## Reaction sensitivity (P[fire within 5 min of step change]) | ||
|
|
||
| | Δ% | 6 | 12 | 30 | 60 | 120 | | ||
| | --- | --- | --- | --- | --- | --- | | ||
| | -50% | 0.70 | 0.55 | 0.33 | 0.16 | 0.09 | | ||
| | -25% | 0.44 | 0.23 | 0.08 | 0.00 | 0.00 | | ||
| | -10% | 0.39 | 0.15 | 0.02 | 0.00 | 0.00 | | ||
| | -5% | 0.40 | 0.15 | 0.02 | 0.00 | 0.00 | | ||
| | +5% | 0.39 | 0.13 | 0.02 | 0.00 | 0.00 | | ||
| | +10% | 0.42 | 0.17 | 0.03 | 0.00 | 0.00 | | ||
| | +25% | 0.48 | 0.23 | 0.07 | 0.01 | 0.00 | | ||
| | +50% | 0.64 | 0.47 | 0.32 | 0.22 | 0.29 | |
There was a problem hiding this comment.
These tables show how long it takes for vardiff to respond to an unexpected change in hashrate. Where the changes are to either increase or decrease by proportional amounts anywhere from 5% to 50%.
The first table specifically looks at a 50% draw down showing that a full 30% of the time vardiff fails to adjust after 5 min. The next few rows show that the situation worsens at higher share rates, at 120 spm 91% of the trials failed to adjust after 5m.
The second table shows that this effect is basically the same for hashrate changes in the opposite direction and also that changes of lesser magnitude respond much more quickly.
| //! - **Convergence rate**: `current >= baseline - 0.01` | ||
| //! - **Convergence p90**: `current <= baseline * 1.10` | ||
| //! - **Settled accuracy p50 / p90**: `current <= baseline * 1.15` | ||
| //! - **Jitter p50**: `current <= baseline + 0.02` (absolute; baseline can be near zero) | ||
| //! - **Jitter p95**: `current <= baseline * 1.25` | ||
| //! - **Reaction rate**: `current >= baseline - 0.02` | ||
| //! - **Reaction p50**: `current <= baseline * 1.20` | ||
| //! - **Sensitivity at large |Δ| (|Δ| >= 50%)**: `current >= baseline - 0.02` | ||
| //! - **Sensitivity at small |Δ| (|Δ| <= 5%)**: `current <= baseline + 0.05` |
There was a problem hiding this comment.
Convergence rate: Must be no more than 1% slower than the baseline convergence time
Convergence p90: The slowest 10% convergence times must be within 10% of the baseline's convergence time
Settled accuracy: must be within 15% of baseline's accuracy for the slowest 50% / 10%
Jitter p50/p95: must be within 2% and 25% of baseline
...etc.
You see the pattern, there are lots of magic thresholds in this portion of the code that are arbitrarily chosen at this point and fair game for analysis.
5cbed7c to
85d6f8b
Compare
|
after some optimization I got
|
I'm so excited to see people other people nerding out on vardiff with me! Thank you! A couple things I noticed in your results, the 2m convergence time is impressive but your response to a 50% hashrate drop only succeeds in readjusting 4.4% of the time. I'm not sure how best to balance those two metrics but probably not one at the expense of the other. |
2d10f57 to
414afbb
Compare
211bc98 to
2a88fde
Compare
|
Some learning's I had @gimballock
|
Two API additions to enable mockable time and bulk share-count operations in the Vardiff trait, prerequisites for the in-process simulation framework added in subsequent commits. Clock injection: - New vardiff/clock.rs with Clock trait, SystemClock, and MockClock. - VardiffState gains an Arc<dyn Clock> field and a new_with_clock constructor. reset_counter and try_vardiff read time via the clock rather than calling SystemTime::now() directly. - Existing constructors (new, new_with_min) default to SystemClock; production behavior is unchanged. Bulk share addition: - Vardiff trait gains add_shares(n: u32) with a default implementation calling increment_shares_since_last_update n times. - VardiffState overrides with a single saturating add. Required for simulation performance — the harness can bulk-add millions of shares per tick during cold-start scenarios where the default's loop would dominate trial runtime. VardiffError::TimeError is now unreachable but retained with a doc comment marking it for removal at the next major version bump; removing it now would break downstream exhaustive matches. Semver note: channels_sv2 should bump from 5.0.0 to 5.1.0 to surface the new add_shares method to downstream consumers, but the project's pinned Rust 1.75 toolchain cannot write the v4 Cargo.lock format that a version change requires. TODO comment in Cargo.toml flags the deferred bump. Tests: 17 vardiff tests pass (12 existing unchanged, 3 new clock-module unit tests, 2 new tests verifying clock injection propagates through VardiffState). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…terization New vardiff_sim crate at sv2/channels-sv2/sim/ providing deterministic behavioral characterization of any Vardiff implementation, plus a regression test that asserts the current algorithm against a checked-in baseline. Purpose: surface the operationally-important attributes of the vardiff algorithm — convergence time, settled accuracy, steady-state jitter, reaction time, reaction sensitivity — in concrete measurable terms so that any future algorithmic improvement (parametric thresholds, EWMA, SPRT, etc.) can be evaluated against a fixed harness and produce a clean delta report. Components: - rng.rs: XorShift64 RNG plus exponential and Poisson samplers (Knuth for λ<30, normal approximation for ≥30). Hand-rolled for cross-version reproducibility without depending on the rand crate's RNG-stability guarantees. - schedule.rs: HashrateSchedule for parameterizing the miner's true hashrate over time. Convenience constructors for stable, step-change, and throttle scenarios. - trial.rs: run_trial drives any Vardiff implementation through duration_secs of simulated time. Per-tick Poisson sampling: at each 60s tick, samples (true_h / estimated_h) * shares_per_minute, bulk- adds via Vardiff::add_shares, calls try_vardiff. Rate-independent — handles λ from near-zero to millions. - metrics.rs: Distribution helper (sorted f64s, p10-p99 percentiles, mean, count) plus the five metric functions. Where a metric can fail (non-converging trials, missing reactions) the rate is reported alongside the distribution. - baseline.rs: Scenario / Cell / CellResult types and run_baseline generic over Vardiff. Default grid is 5 share rates × 10 scenarios = 50 cells. Hand-written TOML and Markdown serialization (avoiding serde + toml dependencies to keep the lockfile minimal). - bin/generate-baseline.rs: CLI entry point. Configurable via VARDIFF_BASELINE_TRIALS, VARDIFF_BASELINE_SEED, VARDIFF_BASELINE_OUT_DIR. - regression.rs: baseline-parsing + per-metric tolerance assertions. The classic_algorithm_no_regression test loads the committed baseline via include_str! and asserts current measurements. Marked #[ignore] because it runs the full ~5s baseline; CI should invoke via cargo test --release --lib -- --ignored. - README.md covering usage, output interpretation, baseline-update workflow, and project-specific notes including the Cargo.lock copy-from-parent rationale. The crate is declared as its own Cargo workspace (its Cargo.toml has a top-level [workspace] section) so its lockfile is independent of the parent stratum workspace. Required because the parent's pinned 1.75 toolchain cannot write v4 lockfiles, and adding the sim crate as a workspace member would force such a write. The committed Cargo.lock is a copy of the parent's. Tests: 53 fast unit tests + 1 #[ignore]-d slow regression test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tate - VARDIFF_SIMULATION_FRAMEWORK.md: design proposal documenting the framework's five metrics, assertion policy, simulation mechanism, and architectural rationale. Co-located with the crate it describes. - vardiff_baseline.toml: machine-readable baseline measurements of the classic VardiffState algorithm across the default 50-cell grid (5 share rates × 10 scenarios, 1000 trials each, base seed 0xDEADBEEFCAFEF00D). Consumed by the regression test in the sim crate. - vardiff_baseline.md: human-readable summary of the same data, organized by metric type for PR review. Notable findings surfaced by the baseline: - Convergence: solid across rates (100% at 30+ spm, 95% at 12 spm, 83% at 6 spm). p50 is ~10 minutes everywhere, dominated by the Phase 1 ×3/min ramp clamp. - Settled accuracy: follows 1/sqrt(N) cleanly. p99 error is 70% at 6 spm, 27% at 12, 15% at 30, 3% at 60, 0% at 120. Low-rate operation is statistically threadbare. - Steady-state jitter: small everywhere and ~0 above 30 spm. The algorithm's growing delta_time post-convergence narrows the effective noise band as 1/sqrt(N), producing accidental self- stabilization at high rates. - Reaction sensitivity DEGRADES with share rate — counterintuitive but mechanistic. The same property that produces low jitter at high rates (growing delta_time after a Phase 1 fire) produces sluggish response to step changes (post-step shares diluted by long pre-step history). At 60+ spm only 9-16% of trials react to a 50% drop within 5 minutes. This baseline is the reference point for evaluating any future algorithmic proposal. The regression test in the sim crate asserts each metric is within tolerance of these recorded values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reframes the sim crate around a four-axis algorithm decomposition
(Estimator / Statistic / Boundary / UpdateRule) and produces a
production-ready algorithm recommendation via systematic
characterization of the design space.
## What the framework now provides
- Four-axis decomposition (sim/src/composed/): four orthogonal
extension traits + Composed<E, S, B, U> adapter that carries a
blanket impl Vardiff, so any composition is a drop-in production
algorithm. The classic algorithm composed as
Composed<CumulativeCounter, AbsoluteRatio, StepFunction(classic),
FullRetargetWithClamp(classic)> is asserted fire-for-fire
equivalent to channels_sv2::VardiffState.
- Algorithm registry (AlgorithmSpec factories in sim/src/grid.rs):
VardiffState, ClassicComposed, Parametric (PoissonCI boundary),
ParametricStrict (z=3.0), EWMA-60s / EWMA(tau), SlidingWindow,
ClassicPartialRetarget, and FullRemedy (EWMA-120 + PoissonCI +
PartialRetarget(0.3)) -- the recommended production composition.
- Grid layer (sim/src/grid.rs): Cartesian product over (algorithm,
share_rate, scenario) with first-class algorithm axis. Grid::run
characterizes every algorithm in one binary; Grid::run_paired uses
algo-stripped seeds for clean A/B comparison.
- Metric registry (sim/src/metrics.rs): Metric trait + 7 per-cell
metrics (convergence, settled_accuracy, jitter, reaction_time,
bias, variance, ramp_target_overshoot) + DerivedMetric trait + 2
cross-cell metrics (decoupling_score, reaction_asymmetry). Each
metric owns its computation, applicability, tolerance, and
Markdown rendering -- adding a new metric is one new impl.
- CI-aware regression (sim/src/regression.rs): Tolerance::WithinCi
with Direction::{HigherIsBetter, LowerIsBetter, Either} checks the
current value against the baseline's bootstrap CI envelope plus a
per-metric absolute/multiplicative slack. Statistical noise is the
floor; engineering tolerance is the deliberate budget on top.
Baseline TOML carries <key>_ci_low / <key>_ci_high for every
percentile, computed with a deterministic bootstrap.
- Trial recording (sim/src/trial.rs): dense per-tick TickRecord with
optional delta/theta/H~ fields populated for Observable
algorithms. Enables bias / variance / overshoot metrics.
- Scenario DSL (sim/src/baseline.rs): Phase::{Hold, Ramp, Stall}
primitives compose into scenarios. Named scenarios kept for
ergonomics; new ones add as Scenario::Custom { ... }.
- Investigation binaries:
- compare-algorithms: 5x10 grid x N trials x every algorithm
- sweep-ewma-tau: tau Pareto sweep with ramp-overshoot tables
- trace-trial: single-trial tick-by-tick inspection, with
--scan-overshoot N to surface worst-case seeds
## Empirical results
docs/FINDINGS.md records the cross-algorithm characterization.
Headline:
- VardiffState's variance-vs-detection paradox: reaction rate to a
-50% step is 70% at SPM=6 but only 9.8% at SPM=120. The share-
rate-blind threshold ladder is fundamentally miscalibrated at
high SPM.
- The SPM=6 cascade: at cold-start, a single Poisson(5.2)->15
outlier at tick 11 (after the Phase 1 ramp lands current_h near
truth) produces a 187% target overshoot under VardiffState /
Parametric. No single-axis change (stricter z, PartialRetarget
alone, EWMA alone) closes both the Phase-1 cascade AND keeps
settled accuracy in check -- only the three-axis composition
FullRemedy does.
- FullRemedy validation: full 5x10x1000 grid shows it dominates
every other algorithm on convergence rate, convergence speed,
reaction rate, reaction sensitivity, ramp overshoot p99, and
decoupling score -- at every share rate. Two well-bounded
trade-offs: ~2.7 stable-load fires/hour at SPM=6 (active
tracking) and a mild negative cold-start bias (EWMA lag during
ramp).
## Production fix
channels_sv2/src/target.rs: U256 precision in hash_rate_from_target.
The intermediate '60 / share_per_min * 100' truncated to integer
before the U256 scale, losing ~5 digits of precision at SPM=30 (49%
inflation in the round-trip estimate at low hashrate). Fix: scale
factor 100 -> 100_000 at target.rs:184 and the matching
'from(100) -> from(100_000)' at line 197. Regression-tested by
hash_rate_round_trip_is_precise_after_u256_fix in the sim crate.
## Documentation
- docs/DESIGN.md: architectural reference (the four axes,
alternative decompositions considered, trait surface, algorithm
registry, composition argument, production migration plan).
- docs/FINDINGS.md: characterization results (cross-algorithm
Pareto, SPM=6 cascade mechanism, axis-isolation experiments,
FullRemedy validation, EWMA tau Pareto, asymmetric step
response).
- README.md: entry point with the algorithm registry and running
instructions.
## Migration path
Production Vardiff trait unchanged. Migrating to FullRemedy is:
1. Promote composed/ types from sim into channels_sv2::vardiff.
2. Add a VardiffState::production_default() returning the
FullRemedy composition.
3. Update production tests that depended on Classic's fire-for-fire
trajectory.
4. Wire 'cargo test --release -- --ignored' into CI.
See docs/DESIGN.md sect "Production migration" for the full plan.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lumns + CI workflow
Extends the framework with operational polish — readability,
statistical context, and CI gating — on top of the four-axis
decomposition shipped in the prior commit.
## Pareto report (compare-algorithms now emits pareto.md)
Single side-by-side cross-algorithm Markdown report. One section per
summary spec, rows = SPMs, columns = algorithms in canonical order
(VardiffState → ClassicComposed → Parametric → ParametricStrict →
ClassicPartialRetarget → EWMA → SlidingWindow → FullRemedy). Winner
bolded per row with a 5%-relative tie window. The reviewer can see
which algorithm wins on each metric × share rate without diff-ing 8
files.
## fundamental_limit floors in TL;DR
The per-algorithm baseline_*.md Summary section now shows the
Poisson-noise lower bound alongside observed values where applicable:
| settled accuracy p50 (stable) | down | 0.0% @ SPM=120 (floor: 1.9%) | 9.3% @ SPM=6 (floor: 8.7%) |
Floors come from each metric's `fundamental_limit(cell, key)`
override (SettledAccuracy, Jitter, RampTargetOvershoot).
Order-of-magnitude reference for whether an algorithm is near-optimal
or has structural room for improvement.
## CI bounds on rate metrics + derived-metric propagation
ConvergenceTime and ReactionTime now record bootstrap-style 95% CI
bounds on the rate (normal-approximation Wilson interval on the
proportion). DecouplingScore and ReactionAsymmetry propagate these
through their derived computations via worst-case substitution:
- decoupling_score CI: `r_lo × (1 − j_hi/J_max)` … `r_hi × (1 − j_lo/J_max)`
- asymmetry_at_X CI: `r(+X)_lo − r(−X)_hi` … `r(+X)_hi − r(−X)_lo`
The comparator already understands CI bounds (Wave 2's
Tolerance::WithinCi), so derived-metric regression checks are now
statistically aware end-to-end.
## reaction_asymmetry: compress to single TL;DR row
Added a `max_abs_asymmetry` aggregate (largest |asymmetry| across
δ ∈ {5, 10, 25, 50} per share rate). One headline row in the TL;DR
replaces what would otherwise be 4 separate rows. Per-magnitude data
still emitted to TOML for deep-dive analysis.
## MD column normalization
Every percentile-emitting metric now uniformly emits
p10 / p50 / p90 / p99 / mean in both TOML and MD:
- Convergence: added mean
- SettledAccuracy: added mean
- Jitter: added p10
- ReactionTime: added mean
- Bias / Variance: added p99, reordered to p10/p50/p90/p99/mean
- RampTargetOvershoot: added p10 + mean
Consistent column shape across the report; reviewers stop
recalibrating column structure per section.
## CI workflow
New .github/workflows/vardiff-sim.yaml runs on PRs touching the sim
crate (path filter so unrelated PRs don't pay the cost):
- test-fast: cargo test --lib (~1s)
- test-regression: cargo test --release --lib -- --ignored
(~15-20s, asserts current algorithm against the checked-in
baseline_VardiffState.toml with CI-aware tolerance budgets)
- build-binaries: cargo build --release --bins
## Documentation
- docs/DESIGN.md: cross-links to pareto.md as the auto-generated
cross-algorithm comparison reference.
- docs/FINDINGS.md: section 1 references pareto.md as the
authoritative comparison data rather than carrying redundant
tables.
## Baselines regenerated
All baseline_<Algorithm>.{md,toml} regenerated with the new keys
(rate CIs, max_abs_asymmetry, p10/mean on RampTargetOvershoot,
mean on Convergence/Settled/Reaction). pareto.md also regenerated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Moves the four-axis algorithm decomposition (Estimator, Statistic,
Boundary, UpdateRule + Composed adapter + DecisionRecord) from
`vardiff_sim::composed` to `channels_sv2::vardiff::composed`. The
production crate now ships the building blocks directly; the sim
crate becomes a thin characterization layer over them.
Additive change to channels_sv2's public API; no breaking change to
existing consumers. The `Vardiff` trait, `Clock` injection, and
`VardiffState` API are unchanged.
## What moves where
- `channels_sv2::vardiff::composed::{Estimator, Statistic, Boundary,
UpdateRule}` -- the four extension traits
- `channels_sv2::vardiff::composed::Composed<E, S, B, U>` -- the
adapter that carries `impl Vardiff` for any four-axis composition
- `channels_sv2::vardiff::composed::{StepFunction, PoissonCI,
CumulativeCounter, EwmaEstimator, SlidingWindowEstimator,
AbsoluteRatio, FullRetargetWithClamp, FullRetargetNoClamp,
PartialRetarget}` -- the shipped concrete impls
- `channels_sv2::vardiff::composed::DecisionRecord` -- the per-tick
introspection struct (`pub` so sim can extend with `Observable`)
- `channels_sv2::vardiff::composed::{classic_composed,
ClassicComposed}` -- the specific composition asserted fire-for-fire
equivalent to `VardiffState`
## New production factory
`VardiffState::production_default(min_hashrate, clock) -> Box<dyn Vardiff>`
returns the recommended `FullRemedy` composition:
EwmaEstimator(120s) + AbsoluteRatio + PoissonCI(z=2.576, margin=0.05)
+ PartialRetarget(eta=0.3)
Empirically dominates the classic `VardiffState` on every
operationally meaningful metric across the canonical 5 x 10 grid.
See `sim/docs/FINDINGS.md` sect 1 + 4 for the validation case and
`sim/docs/DESIGN.md` for the architectural rationale.
Trade-offs (`FINDINGS.md` sect 5): ~2.7 stable-load fires/hour at
SPM=6 (active tracking, not flicker) and a mild negative cold-start
bias (EWMA lag during ramp -- harmless or beneficial since it
accelerates share arrival). Both well-bounded.
## Sim crate becomes a re-export shim
- `sim/src/composed/` directory replaced by a flat `sim/src/composed.rs`
that re-exports `channels_sv2::vardiff::composed::*` plus the
sim-only `impl Observable for Composed<E, S, B, U>` extension
(orphan-rule compatible: `Observable` is local to sim) and the
fire-for-fire equivalence-test suite (which still lives in sim
because it depends on the sim trial driver).
- `sim/src/trial.rs` re-exports `DecisionRecord` from production
rather than defining it locally.
All existing sim-internal callers (`use crate::composed::*`) keep
working unchanged.
## Migration path for production consumers
Existing call sites holding `Box<dyn Vardiff>` or `impl Vardiff`
need no source changes. To opt into the new algorithm:
// Before
let v = VardiffState::new_with_clock(min_h, clock)?;
// After (recommended)
let v = VardiffState::production_default(min_h, clock);
Both produce a valid `Vardiff` implementation; the latter is the
behaviorally improved composition.
## Baselines regenerated
All `baseline_<Algorithm>.{md,toml}` and `pareto.md` regenerated
against the post-migration code paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Polish-only changes after the production migration: - vardiff-sim.yaml: rustup override per job so the sim crate compiles on stable (workspace root pins 1.75; sim needs newer). - clippy: address type_complexity (MetricEntry alias), doc_lazy_continuation in grid.rs, unnecessary_get_then_check in metrics.rs, and silence unnecessary_map_or in baseline.rs (Option::is_none_or requires 1.82+). - rustfmt across the touched sim and production composed files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… sweep binaries - Add AlgorithmSpec::full_remedy_with(tau, eta, z) parameterized constructor. - Add bin/sweep-eta, bin/sweep-z, bin/sweep-eta-z for per-axis and joint Pareto characterization. - Retune FullRemedy default from η=0.3 to η=0.2 based on joint (η, z) sweep: η=0.2 captures the overshoot-tail reduction (ramp p99 at SPM=6: 31% → 12%) and decoupling gains (0.79 → 0.87 at SPM=6) while preserving cold-start convergence at ≥99% across every SPM. Smaller η would break convergence catastrophically (η=0.1 produces 48% convergence at SPM=120). - Regenerate all per-algorithm baselines, pareto.md, z_sweep.md at the new default. - Update FINDINGS.md §§ 1, 3, 4, 5 with the new numbers and characterization narrative. DESIGN.md, production crate docs, and sweep-binary docstrings updated for consistency.
…te::new Make FullRemedy the recommended production vardiff via free factory functions in vardiff/mod.rs (default, default_with_min, default_with_clock). Deprecate the implicit-pick VardiffState constructors (new, new_with_min) with a migration note pointing at the new factories; new_with_clock stays non-deprecated as the explicit opt-in to the classic threshold-ladder algorithm for simulation, characterization, and testing. This gives downstream consumers a clear, semver-safe migration path: existing code keeps compiling with a deprecation warning that links to the new factory; new code is steered to FullRemedy by default.
The *100 -> *100_000 scaling fix in hash_rate_from_target lowered the
safe target ceiling from ~2^246.4 to ~2^236.4, because the intermediate
product `(t+1) * shares_occurrency_frequence` is computed in U256.
Routine vardiff-driven targets at low realized share rates push above
the new boundary, and broadcast SetTarget messages carrying a channel's
`requested_max_target` (~2^253) trip it every time.
In production, this surfaced on a slot running translator_sv2 with
vardiff disabled: every upstream SetTarget logs
WARN: Failed to derive hashrate from SetTarget target:
ArithmeticOverflow (channel_id=4294967295)
and the translator's SV1 hashrate gauge stops updating.
The fix widens the multiply step to U512 so the intermediate product
fits regardless of target magnitude. The numerator stays in U256 (it is
2^256 - t, which always fits), and the final result narrows back to
u128 via low_u128() exactly as before. The precision improvement from
the *100_000 scaling is preserved.
Two regression tests:
- target.rs: pin three real `maximum_target` values captured from the
affected slot's translator log, including the channel's
`requested_max_target` (0x1745d174_5d1745d1...). All previously
errored ArithmeticOverflow; they now return finite hashrates.
- vardiff/test/mod.rs: update the
test_try_vardiff_with_less_spm_than_expected_classic expected values
for the 240s and 300s checkpoints. Upstream's 74.2 / 62.327995
values came from the `try_vardiff` Err-fallback path
(`hashrate * realized_spm / shares_per_minute`) which only ran
because hash_rate_from_target overflowed at those high targets.
With overflow eliminated, the main path returns the integer-
truncated 74.0 / 62.0 from low_u128().
…metrics
Complete redesign of the vardiff framework and a new production algorithm
that dominates the legacy implementation on every operational metric.
## New Algorithm: AdaCUSUM
VardiffState::new() now returns the AdaCUSUM composition internally:
EwmaEstimator(120s) + AdaptiveCusumBoundary(s=1.5, f=0.05) + PartialRetarget(0.5)
Zero downstream API changes — sv2-apps continues calling VardiffState::new().
Performance vs legacy VardiffState at SPM=12:
- Convergence: 6 min vs 10 min (40% faster)
- React to -10% decline: 62% vs 14% (4.4x better)
- React to -50% decline: 99% vs 55% (near-perfect)
- Jitter: 0.107/min vs 0.018/min (acceptable trade-off)
- Overshoot: 26% vs 87% (3.3x less)
## Architecture: Three-Stage Pipeline
Estimator → Boundary → UpdateRule
- Statistic trait removed (deviation is inline arithmetic)
- Composed<E, S, B, U> → Composed<E, B, U>
- Estimator::reset() → on_fire(new_hashrate, old_hashrate)
- EstimatorSnapshot gains uncertainty: Option<Uncertainty>
- Boundary receives &EstimatorSnapshot (uncertainty-aware)
- UpdateRule receives threshold (margin-aware)
- EwmaEstimator::on_fire rescales instead of zeroing
## operational_fitness Metric
fitness = 0.25 × reaction_rate(-10%)
+ 0.20 × reaction_rate(-50%)
+ 0.20 × clamp(1 - jitter/0.30, 0, 1)
+ 0.25 × convergence_rate × clamp(1 - conv_p50/600s, 0, 1)
+ 0.10 × clamp(1 - overshoot_p99, 0, 1)
## Components Explored
New estimators: BayesianEstimator, KalmanEstimator
New boundaries: CredibleIntervalBoundary, CusumBoundary, AdaptiveCusumBoundary
New update rules: AdaptivePartialRetarget
New metrics: FireDecisiveness, StepCorrection, OperationalFitness
All 104 channels_sv2 + 141 sim tests pass.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds AsymmetricCusumBoundary that uses different thresholds based on whether firing would tighten or ease difficulty: - Easing (miner slowing): uses base threshold (fire quickly, free action) - Tightening (miner speeding): uses base × tighten_multiplier (cautious, because tightening rejects in-flight shares) Rationale: SetTarget that makes difficulty harder invalidates shares already being computed by miners. SetTarget that makes difficulty easier has zero cost — old harder work is still valid under the new easier target. Results with tighten_multiplier=3.0 (AsymCUSUM-t30): Mean fitness: 0.751 (vs symmetric AdaCUSUM 0.676, FullRemedy 0.565) Jitter: 0.045/min (vs 0.175 symmetric, 58% reduction) Convergence: 4 min (vs 6 min symmetric, 33% faster) Overshoot: 16.6% (vs 26.4% symmetric, 37% less) Detection -50%: 99.0% (unchanged) Detection -10%: 44.3% (vs 61.9% — the cost of asymmetry) The jitter reduction directly translates to fewer share rejections in production: ~3 costly tighten-fires per hour instead of ~6. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ran cargo fmt to fix line-length violations introduced during rebase. Regenerated baseline_VardiffState.toml to reflect the current AsymmetricCusumBoundary behavior (directional cost awareness produces expected negative asymmetry values at higher SPM rates). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
63a19d0 to
a18c3a3
Compare
|
Thanks for these insights @adammwest — especially the point about fitness decomposition and normalization. A lot of what you're describing matches the evolution I've gone through on this PR, so let me give a timeline of Phase 1: Basic metrics + simulation harness Initially I focused on three metrics I thought were important: convergence time, jitter, and accuracy. These were evaluated via a time-compressed simulation that replays a synthetic share stream through the vardiff Phase 2: Decomposed pipeline model I wanted to make algorithm search more systematic, so I decomposed "a vardiff algorithm" into four independent, replaceable components: estimator, statistic, boundary, and decision rule. The idea was to mix-and-match This model worked well for the classic algorithm, the parametric variant, and the EWMA approach. But when I tried to embed a Bayesian model, it broke down — the components aren't truly independent. There's a sequential The resulting three-stage pipeline (Estimator → Boundary → UpdateRule) is what's in this PR. It successfully hosts the classic algorithm, EWMA, AdaCUSUM, and could host a Bayesian approach. Phase 3: Aggregate fitness metric To your point about "how you combine all metrics into a final value" — we now have a configurable aggregate metric that allows weighting across the underlying measurements. This addresses exactly the gaming concern you Your suggestion to separate fitness into improvement vs. regression categories per scenario group (stable, coldstart, reaction) is a good one. Currently the regression test does compare per-cell, so a coldstart regression Phase 4: Realistic operating conditions After discussions with hardware engineers, I retuned the test scenarios to realistic share rates (2–30 spm instead of the earlier 6–120 range). The engineers confirmed that responding to partial hardware failures and Current direction I've backed off from prioritizing convergence speed after seeing overcorrection in practice. The current focus is on:
On your point about normalization: agreed, and the per-metric tolerance budgets in the regression test (absolute slack + optional multiplicative slack) are our current mechanism for this. Open to suggestions on better |
…magnitude metric
Update OperationalFitness weights to penalize algorithms that raise
difficulty as aggressively as they lower it:
fitness = 0.25 × reaction(-10%) + 0.20 × reaction(-50%)
+ 0.20 × jitter + 0.15 × convergence
+ 0.10 × asymmetry_preference + 0.10 × overshoot
The asymmetry term rewards algorithms where reaction_rate(Step -10%)
exceeds reaction_rate(Step +10%) — i.e. faster detection of hashrate
drops than spikes. This encodes the operational insight that upward
difficulty adjustments are more disruptive (causing difficulty-too-low
share rejections) than downward ones.
New metric: upward_step_magnitude — tracks the ratio (new/old) of all
upward difficulty adjustments during steady state. The p95 value serves
as a deterministic proxy for difficulty-too-low rejection risk without
depending on stochastic share-rejection simulation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rity Restructure OperationalFitness to reflect operational reality: "never surprise the miner" matters more than "detect changes fast." New weights (harm-avoidance 60%, reactivity 25%, convergence 15%): 0.15 × reaction(-10%) + 0.10 × reaction(-50%) 0.25 × jitter_control + 0.25 × step_magnitude_safety 0.15 × convergence + 0.10 × overshoot_safety The step_magnitude_safety term directly penalizes large upward difficulty jumps: p95 step of 1.0× scores 1.0, 1.5× scores 0.0. Result: FullRemedy now wins at 6-12 spm (realistic miner rates) where its cautious approach correctly avoids the timeout death spiral observed with physical miners. VardiffState still wins at 15-30 spm where aggressive reactivity is less harmful. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a deterministic in-process simulation framework that characterizes
any
Vardiffimplementation across the operational rate range, andcommits the current algorithm's measurements as a baseline for automated
regression testing.
The "vardiff fires too often on noise" / "vardiff doesn't react fast
enough" conversations have circled the same issues (#396 and adjacent)
without a way to settle questions empirically. This PR adds the missing
infrastructure: any future proposal can now produce a quantitative
delta against a fixed reference.
The finding that motivates this
Before the framework, "is the algorithm too noisy or too sluggish?" was
a matter of opinion. With it, the question is a table:
Reaction sensitivity to a -50% step change (probability of firing
within 5 minutes after the change):
Higher share rates produce less responsive algorithms — counter to
expectations. The mechanism is mechanical (post-convergence
delta_timegrows indefinitely, diluting the post-step signal in the cumulative
window) and surfaced clearly in the data despite taking months of
deployment observation to half-articulate. Full numbers and analysis
in
vardiff_baseline.md.This isn't a fix. It's the measurement that lets the fix be evaluated.
What's in the PR
3 commits, ~2000 LOC plus baseline data:
feat(vardiff): inject Clock trait + add_shares trait methodMinimum API additions to
channels_sv2for testability andsimulation performance. Production behavior unchanged — existing
constructors default to
SystemClock, the new trait method has adefault implementation that keeps existing impls compiling.
feat(vardiff_sim): in-process simulation frameworkNew crate at
sv2/channels-sv2/sim/. Per-tick Poisson sharesampling, five behavioral metrics with percentile distributions,
50-cell parameterized sweep (5 share rates × 10 scenarios), CLI
binary for baseline generation, regression test asserting against
a committed baseline.
data(vardiff_sim): design doc + baseline characterizationThe design proposal documenting metric definitions and tolerance
policy, plus the measured baseline as both TOML (consumed by the
regression test) and Markdown (for human review).
What the framework measures
Five behavioral attributes, each as a distribution across 1000
independent trials per cell:
Per-metric tolerances are asserted automatically against the checked-in
baseline. Failed assertions identify the cell and metric with specific
baseline-vs-current numbers. Mid-range Δ values (10-25%) are reported
but not asserted — that's where legitimate algorithmic tradeoffs live
and a reviewer should judge by looking at the full delta.
How to run
From
sv2/channels-sv2/sim/:What this enables
For any future vardiff proposal:
Vardiffimplcargo run --release --bin generate-baselineto produce comparablemeasurements
No more "I think this is better." Instead "this changes p50 jitter from
X to Y at 12 spm, at the cost of p90 reaction time going from A to B."
Where to look in this PR
rationale):
sv2/channels-sv2/sim/VARDIFF_SIMULATION_FRAMEWORK.mdworkflow):
sv2/channels-sv2/sim/README.mdsv2/channels-sv2/sim/vardiff_baseline.mdWhat this PR is NOT
VardiffStatebehavior is unchanged.The only public-API additions are
Vardiff::add_shares(with adefault impl) and the
Clocktrait. Production code defaults toSystemClockand behaves identically to before.data suggests 12-30 spm is the operational sweet spot, but this
PR doesn't touch any defaults.
a GitHub Action to be a true CI gate. Follow-up.
Open follow-ups
cargo test --release --lib -- --ignoredinto CI on PRstouching
vardiff/*or the sim crate.channels_sv25.0.0 → 5.1.0 once the workspace lockfilesituation allows (the trait-method addition is technically a
minor-version semver change). TODO comment in
Cargo.tomltracksthis.
surfaces the problem; fixing it is a separate proposal that this
framework will be the right tool to evaluate.
Test plan
cargo test -p channels_sv2 --lib vardiff— 17 tests, all passcargo testfromsv2/channels-sv2/sim/— 53 fast unit testscargo test --release --lib -- --ignoredfrom sim/ — slowregression test passes against committed baseline
cargo run --release --bin generate-baseline— reproduces thecommitted
vardiff_baseline.tomlbyte-for-byte at the same seed