Skip to content

feat(knee-point): noise-estimated ε helper#72

Merged
jramos merged 2 commits into
mainfrom
feat/knee-point-noise-estimated-epsilon
May 24, 2026
Merged

feat(knee-point): noise-estimated ε helper#72
jramos merged 2 commits into
mainfrom
feat/knee-point-noise-estimated-epsilon

Conversation

@jramos
Copy link
Copy Markdown
Owner

@jramos jramos commented May 24, 2026

Summary

Adds a pure-function helper _estimate_val_noise(...) to evolution/skills/knee_point.py that estimates a noise-floor ε for the Pareto knee-point band via a paired bootstrap on per-example per-candidate val scores (DSPy's DspyGEPAResult.val_subscores).

This PR introduces no behavior change. The helper is unused in production code — confirmed by grep -rn '_estimate_val_noise' evolution/ returning only the definition. A follow-up PR will plumb it through select_knee_point behind a --knee-point-epsilon noise-estimated sentinel, gated on empirical evidence from a regenerated calibration campaign.

Motivation

The current knee-point selector uses a fixed ε = 1.0 / n_val band. Investigation showed this is statistically incoherent:

  • At our typical n_val (8–50), the actual paired-bootstrap noise floor on val scores is ~0.04 empirically.
  • 1/n_val is ~0.02–0.125 — an order of magnitude below the real noise.
  • The May 2026 calibration data shows our knee-point selector picks the same candidate as GEPA's val-argmax in 7 of 9 runs at the current ε — because the band rarely separates from val-argmax when ε is sub-noise.

The GP-parsimony literature (Poli & McPhee, "Parsimony Pressure Made Easy," GECCO 2008) treats the ε-band as a hyperparameter that should be set from a noise estimate. The new helper computes that estimate from data DSPy already returns.

Helper contract

  • Single candidate: returns 0.5 / sqrt(n_val) (worst-case binomial SE at p=0.5) as a degenerate-path floor.
  • All-saturated scores: returns 0.0 (no useful signal → no band).
  • Otherwise: paired bootstrap on the concatenated pairwise diff vector between best and each competitor; returns half the confidence-level CI width on the mean (default 90% CI, 1000 resamples, seeded for determinism).
  • Coverage policy: positional alignment with min(len(best), len(other)) truncation.

Stdlib only (math, random). No numpy/scipy.

Test plan

  • 6 unit tests in tests/skills/test_knee_point_noise_estimation.py: saturated → 0, Bernoulli analytical-SE reference (order-of-magnitude pin at p=0.5, n=50; paired-diff SE not single-Bernoulli SE), monotone-in-variance, single-candidate fallback (exact value at n=64), determinism, partial-coverage non-crash.
  • Full suite green locally (1150 passed).
  • CI green across all Python versions.

What's next

A regenerated calibration campaign (~$140, out-of-band) will produce gate_decision.json + band_holdout.json artifacts. An extended study_b_pick_epsilon.py will replay the selector with three modes — current 1/n_val, the original calibration recommendation 0.5/n_val, and this PR's noise-estimated mode — to decide empirically whether the noise-estimated default beats the geometric one on real per-skill holdout pick quality. If yes, a follow-up PR plumbs the sentinel through. If no, the right ship is to drop the selector for val-best and defer to GEPA's best_idx.

jramos added 2 commits May 24, 2026 09:10
Address review feedback on c9680f7:
- Add a docstring sentence explaining the 0.5/sqrt(n_val) fallback
  (worst-case binomial SE at p=0.5) so the magic 0.5 is justified.
- Move the n_val computation into the fallback branch; it was dead
  on the main path.
@jramos jramos merged commit d94571f into main May 24, 2026
4 checks passed
@jramos jramos deleted the feat/knee-point-noise-estimated-epsilon branch May 24, 2026 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant