Skip to content

feat(knee-point): drop selector for val-best after empirical no-op verification#73

Merged
jramos merged 4 commits into
mainfrom
feat/knee-point-drop-selector-for-val-best
May 24, 2026
Merged

feat(knee-point): drop selector for val-best after empirical no-op verification#73
jramos merged 4 commits into
mainfrom
feat/knee-point-drop-selector-for-val-best

Conversation

@jramos
Copy link
Copy Markdown
Owner

@jramos jramos commented May 24, 2026

Summary

Drops the knee-point ε-band selector for the val-best strategy (the default). The val-best path now defers directly to GEPA's `details.best_idx` — matching the GEPA paper's prescribed termination behavior. `--knee-point-strategy smallest` is unchanged and still routes through `select_knee_point` for compression-bias users.

Reverts the noise-estimated ε helper introduced earlier this cycle — investigation showed it would have been a no-op too, and its type signature was incorrect relative to DSPy's actual `val_subscores` shape (`list[dict[int, float]]`, not `list[list[float]]`).

Why

A regenerated calibration campaign (10 runs across nano-pdf, apple-notes, polymarket, huggingface-hub at N*=250, ratio*=0.65) replayed the selector with 5 ε modes:

Mode Mean transfer error Deploy rate n_runs
1.0/n_val (status quo) 0.0466 70% 10
0.5/n_val (prior recommendation) 0.0466 70% 10
2.0/n_val 0.0466 70% 10
3.0/n_val 0.0466 70% 10
noise-estimated (paired-bootstrap) 0.0466 70% 10

Every mode picks the same candidate in every run. 10/10 agreement with GEPA's val-argmax confirms the selector is a no-op on this corpus. Details in `reports/calibration_findings.md` Finding 3.

Behavior change

Intentional, narrow: when `band_size >= 2` AND `best_idx`'s candidate fails static validation AND another band member would pass, the old code picked the non-best candidate; the new val-best path will reject. The calibration's 10/10 pick == best_idx says the band walk was empirically never invoked. Users who need the band-walk recovery can switch to `--knee-point-strategy smallest`.

Commits

  1. `revert: remove _estimate_val_noise helper + tests`
  2. `refactor(evolve_skill,evolve_tool): defer to GEPA best_idx for val-best knee-point`
  3. `docs(calibration_findings): record empirical confirmation that ε is a no-op`
  4. `polish: clean up dead select_knee_point references + stale CLI help text`

Test plan

  • Full suite: 1144 passed locally (same as baseline minus the 6 deleted noise-estimation tests; net delta zero because the campaign confirmed the helper would have been no-op anyway)
  • `tests/skills/test_knee_point.py` (21) — smallest-strategy path intact
  • `tests/skills/test_evolve_skill_validation_flow.py` + `tests/tools/test_evolve_tool_validation_flow.py` — `gate_decision.json::knee_point` schema preserved via `_deferred_knee_point_payload` (`band_roster: []`)
  • `tests/{skills,tools}/test_evolve_*_cl_aware_gate.py` — 61 tests confirm gate behavior unchanged
  • `tests/{skills,tools}/test_evolve_*_saturation_preflight.py` — 16 tests, refactored 6 sites to use real GEPA-shaped mocks instead of patched `select_knee_point`
  • Empirical: 10-run campaign artifacts referenced in `calibration_findings.md` Finding 3
  • CI green across all Python versions

jramos added 4 commits May 24, 2026 12:50
The helper was added on the assumption of list[list[float]] for
val_subscores, but DSPy returns list[dict[int, float]] (sparse-coverage
native). With the empirical finding that the knee-point epsilon band
is a no-op on this corpus, the helper has no production call site and
the wrong type signature. Cleanest action is to revert.
…st knee-point

Regenerated calibration across nano-pdf, apple-notes, polymarket, and
huggingface-hub at N*=250, ratio*=0.65 showed the epsilon-band knee-point
selector picks GEPA's val-argmax 10/10 across five epsilon modes (1/n_val,
0.5/n_val, 2/n_val, 3/n_val, noise-estimated paired-bootstrap). The
selector is a no-op for the default val-best path on this corpus.

evolve_skill now branches the call site on --knee-point-strategy:
val-best (default) skips select_knee_point entirely and uses
details.candidates[best_idx] directly; smallest keeps the existing
band-walk path for compression-bias users. evolve_tool has no strategy
flag, so the call site reduces to the val-best short-circuit.

The gate_decision.json knee_point block now carries either a full
CandidatePick payload (smallest path) or a minimal deferred payload
(fallback="gepa_default", band_roster=[]) so downstream calibration
consumers don't crash on key access.

Two saturation-preflight tests previously relied on a patched
select_knee_point to inject a real CandidatePick; updated to configure
the fake GEPA's compile() output with a real-shaped detailed_results
namespace instead.
… no-op

Add a 2026-05-24 update to Finding 3 documenting the regenerated
four-skill (nano-pdf, apple-notes, polymarket, huggingface-hub) replay
at N*=250, ratio*=0.65: 10 runs × 5 ε modes (1/n_val, 0.5/n_val,
2/n_val, 3/n_val, paired-bootstrap noise-estimated) all produced
identical mean transfer error 0.0466 and 70% deploy rate. The selector
is a no-op on val-best for this corpus; --knee-point-strategy smallest
is preserved for compression-bias users.
Address review feedback on the val-best short-circuit:
- Drop unused select_knee_point / CandidatePick / _knee_point_payload
  from evolve_tool.py (no smallest branch exists there).
- Remove dead select_knee_point patches from 6 test sites where the
  default val-best routing no longer invokes the patched symbol.
- Refresh --knee-point-epsilon and --knee-point-strategy help text
  so the CLI documentation matches the post-commit semantics.
- Add a single line near the val-best short-circuit pointing at the
  smallest strategy as the recovery path for static-failure regressions.
@jramos jramos merged commit 089f391 into main May 24, 2026
4 checks passed
@jramos jramos deleted the feat/knee-point-drop-selector-for-val-best branch May 24, 2026 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant