feat(knee-point): noise-estimated ε helper#72
Merged
Conversation
Address review feedback on c9680f7: - Add a docstring sentence explaining the 0.5/sqrt(n_val) fallback (worst-case binomial SE at p=0.5) so the magic 0.5 is justified. - Move the n_val computation into the fallback branch; it was dead on the main path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a pure-function helper
_estimate_val_noise(...)toevolution/skills/knee_point.pythat estimates a noise-floor ε for the Pareto knee-point band via a paired bootstrap on per-example per-candidate val scores (DSPy'sDspyGEPAResult.val_subscores).This PR introduces no behavior change. The helper is unused in production code — confirmed by
grep -rn '_estimate_val_noise' evolution/returning only the definition. A follow-up PR will plumb it throughselect_knee_pointbehind a--knee-point-epsilon noise-estimatedsentinel, gated on empirical evidence from a regenerated calibration campaign.Motivation
The current knee-point selector uses a fixed
ε = 1.0 / n_valband. Investigation showed this is statistically incoherent:1/n_valis ~0.02–0.125 — an order of magnitude below the real noise.The GP-parsimony literature (Poli & McPhee, "Parsimony Pressure Made Easy," GECCO 2008) treats the ε-band as a hyperparameter that should be set from a noise estimate. The new helper computes that estimate from data DSPy already returns.
Helper contract
0.5 / sqrt(n_val)(worst-case binomial SE at p=0.5) as a degenerate-path floor.0.0(no useful signal → no band).min(len(best), len(other))truncation.Stdlib only (
math,random). No numpy/scipy.Test plan
tests/skills/test_knee_point_noise_estimation.py: saturated → 0, Bernoulli analytical-SE reference (order-of-magnitude pin at p=0.5, n=50; paired-diff SE not single-Bernoulli SE), monotone-in-variance, single-candidate fallback (exact value at n=64), determinism, partial-coverage non-crash.What's next
A regenerated calibration campaign (~$140, out-of-band) will produce
gate_decision.json+band_holdout.jsonartifacts. An extendedstudy_b_pick_epsilon.pywill replay the selector with three modes — current1/n_val, the original calibration recommendation0.5/n_val, and this PR's noise-estimated mode — to decide empirically whether the noise-estimated default beats the geometric one on real per-skill holdout pick quality. If yes, a follow-up PR plumbs the sentinel through. If no, the right ship is to drop the selector for val-best and defer to GEPA'sbest_idx.