feat(knee-point): noise-estimated ε helper by jramos · Pull Request #72 · jramos/agent-self-evolution

jramos · 2026-05-24T15:17:53Z

Summary

Adds a pure-function helper _estimate_val_noise(...) to evolution/skills/knee_point.py that estimates a noise-floor ε for the Pareto knee-point band via a paired bootstrap on per-example per-candidate val scores (DSPy's DspyGEPAResult.val_subscores).

This PR introduces no behavior change. The helper is unused in production code — confirmed by grep -rn '_estimate_val_noise' evolution/ returning only the definition. A follow-up PR will plumb it through select_knee_point behind a --knee-point-epsilon noise-estimated sentinel, gated on empirical evidence from a regenerated calibration campaign.

Motivation

The current knee-point selector uses a fixed ε = 1.0 / n_val band. Investigation showed this is statistically incoherent:

At our typical n_val (8–50), the actual paired-bootstrap noise floor on val scores is ~0.04 empirically.
1/n_val is ~0.02–0.125 — an order of magnitude below the real noise.
The May 2026 calibration data shows our knee-point selector picks the same candidate as GEPA's val-argmax in 7 of 9 runs at the current ε — because the band rarely separates from val-argmax when ε is sub-noise.

The GP-parsimony literature (Poli & McPhee, "Parsimony Pressure Made Easy," GECCO 2008) treats the ε-band as a hyperparameter that should be set from a noise estimate. The new helper computes that estimate from data DSPy already returns.

Helper contract

Single candidate: returns 0.5 / sqrt(n_val) (worst-case binomial SE at p=0.5) as a degenerate-path floor.
All-saturated scores: returns 0.0 (no useful signal → no band).
Otherwise: paired bootstrap on the concatenated pairwise diff vector between best and each competitor; returns half the confidence-level CI width on the mean (default 90% CI, 1000 resamples, seeded for determinism).
Coverage policy: positional alignment with min(len(best), len(other)) truncation.

Stdlib only (math, random). No numpy/scipy.

Test plan

6 unit tests in tests/skills/test_knee_point_noise_estimation.py: saturated → 0, Bernoulli analytical-SE reference (order-of-magnitude pin at p=0.5, n=50; paired-diff SE not single-Bernoulli SE), monotone-in-variance, single-candidate fallback (exact value at n=64), determinism, partial-coverage non-crash.
Full suite green locally (1150 passed).
CI green across all Python versions.

What's next

A regenerated calibration campaign (~$140, out-of-band) will produce gate_decision.json + band_holdout.json artifacts. An extended study_b_pick_epsilon.py will replay the selector with three modes — current 1/n_val, the original calibration recommendation 0.5/n_val, and this PR's noise-estimated mode — to decide empirically whether the noise-estimated default beats the geometric one on real per-skill holdout pick quality. If yes, a follow-up PR plumbs the sentinel through. If no, the right ship is to drop the selector for val-best and defer to GEPA's best_idx.

…tests

Address review feedback on c9680f7: - Add a docstring sentence explaining the 0.5/sqrt(n_val) fallback (worst-case binomial SE at p=0.5) so the magic 0.5 is justified. - Move the n_val computation into the fallback branch; it was dead on the main path.

jramos added 2 commits May 24, 2026 09:10

feat(knee_point): noise-estimated ε helper with synthetic-front unit …

c9680f7

…tests

jramos merged commit d94571f into main May 24, 2026
4 checks passed

jramos deleted the feat/knee-point-noise-estimated-epsilon branch May 24, 2026 15:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(knee-point): noise-estimated ε helper#72

feat(knee-point): noise-estimated ε helper#72
jramos merged 2 commits into
mainfrom
feat/knee-point-noise-estimated-epsilon

jramos commented May 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jramos commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Helper contract

Test plan

What's next

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jramos commented May 24, 2026 •

edited

Loading