Skip to content

feat(evolve): default to gepa improvement-or-equal acceptance criterion (Path D)#74

Merged
jramos merged 6 commits into
mainfrom
feat/path-d-gepa-acceptance-criterion
May 25, 2026
Merged

feat(evolve): default to gepa improvement-or-equal acceptance criterion (Path D)#74
jramos merged 6 commits into
mainfrom
feat/path-d-gepa-acceptance-criterion

Conversation

@jramos
Copy link
Copy Markdown
Owner

@jramos jramos commented May 25, 2026

Summary

Defaults acceptance_criterion to improvement_or_equal (was implicit strict in gepa<0.1.2) for GEPA's minibatch acceptance test. Adds --gepa-acceptance {strict-improvement,improvement-or-equal} CLI flag on evolve_skill and evolve_tool, plumbed through dspy.GEPA's gepa_kwargs passthrough to gepa.optimize. Closes the last remaining downstream item from the deploy-gate arc.

Why

GEPA's minibatch acceptance test at gepa/core/engine.py:493 historically hard-coded strict improvement (new_sum > old_sum). Under LM-judge noise on small minibatches (3-8 examples), this rejects "true zero-difference" candidates roughly half the time, narrowing the search and reducing downstream Pareto-frontier diversity.

The strict-vs-non-strict choice has no motivation in the GEPA paper (arXiv:2507.19457): Algorithm 1 just says "if σ′ improved" — undefined operator, no ablation, no discussion of minibatch noise. The paper's first author (Lakshya Agrawal) shipped this as a configurable choice in gepa-ai/gepa#304 with the explicit rationale that improvement-or-equal "allow[s] lateral moves that don't improve the score but may explore different regions of the solution space."

Adjacent literature (Beyer 2000, Aizawa & Wah 1994, Rakshit et al. 2017) treats strict-elitist acceptance under noisy fitness as a known anti-pattern. Improvement-or-equal is the lightest-touch mainstream fix.

Empirical validation (A/B smoke)

Same skill (nano-pdf), same seed (42), same everything else — only --gepa-acceptance varies:

Metric strict-improvement improvement-or-equal Delta
run_inputs.gepa_acceptance (sanity) strict_improvement improvement_or_equal kwarg reached gepa
Cost $1.24 $1.26 basically free
Candidates accepted 5 10 +5 (2x as many)
Picked val score 0.860 0.860 tied
Picked body chars 1631 1797 +166 (+10%)
Holdout improvement Δ +0.446 +0.487 +0.041 (~9% better)
Bootstrap lower bound 0.388 0.425 +0.037 (tighter CI)
Time 294.7s 328.9s +34s (more candidates evaluated)

Improvement-or-equal accepted 2x as many candidates — exactly the prediction from theory. More candidates → better Pareto front → +0.041 holdout improvement and a higher bootstrap lower bound, at no extra cost.

Dependency setup

The gepa PR landed 2026-04-06 but no PyPI release contains it yet (latest gepa==0.1.1 was uploaded 2026-03-16; DSPy 3.2.1 still pins gepa==0.0.27). Bridged by:

  • Git-pinning gepa to PR #304's merge SHA 5e24ee5c8e1857a62a1ba19731de9da45ffb6f1b
  • [tool.uv] override-dependencies to bypass DSPy 3.2.0's hard-pin on gepa[dspy]==0.0.27

Documented inline in pyproject.toml. When gepa 0.1.2 ships (or DSPy bumps via stanfordnlp/dspy#9673, which is merged but unreleased), the git pin + override can be swapped to a version pin in a one-line change.

Bug caught during validation

Initial CLI choice was ["strict", "improvement-or-equal"]. The hyphen-to-underscore conversion produced "strict" for the first option, but gepa rejects unknown criteria — only "strict_improvement" and "improvement_or_equal" are valid. Caught by the A/B smoke (first attempt's strict run fell back to MIPROv2 → optuna ImportError), fixed by renaming the CLI value to strict-improvement.

Test plan

  • Unit tests pin the gepa_kwargs passthrough at the DSPy constructor (skill + tool sides, both criteria)
  • run_inputs records the gepa_acceptance value for forensic replay
  • Full suite green locally (1148 passed)
  • A/B smoke validates end-to-end behavior + kwarg reaches gepa correctly
  • CLI --help shows both choices with explanation
  • CI green across all Python versions

Files

  • pyproject.toml — git pin gepa to PR #304 merge SHA + uv override-dependencies
  • evolution/core/config.pygepa_acceptance: str = "improvement_or_equal" field
  • evolution/skills/evolve_skill.py + evolution/tools/evolve_tool.py — CLI flag, plumbing
  • evolution/core/run_inputs.py — record gepa_acceptance in the run_inputs payload
  • tests/{skills,tools}/test_evolve_*_validation_flow.py — passthrough regression tests
  • tests/core/test_run_inputs.py — new field assertion
  • reports/calibration_findings.md — Path D section with the rationale

jramos added 6 commits May 24, 2026 14:32
Adds the gepa_acceptance string ("strict" or "improvement_or_equal")
to the run_inputs payload so a third party holding only
gate_decision.json can tell which acceptance criterion produced the
run. Threaded through all 5 build_run_inputs call sites; updated
schema tests for the new key.
The shipped --gepa-acceptance flag offered "strict" as a choice, but
gepa.optimize rejects that string and only accepts "strict_improvement"
or "improvement_or_equal". The smoke surfaced this: a --gepa-acceptance
strict run raised ValueError("Unknown acceptance_criterion: strict")
inside gepa, triggering the MIPROv2 fallback path.

Rename the CLI choice to "strict-improvement" so the hyphen→underscore
conversion produces gepa's canonical "strict_improvement". Update tests,
config docstring, help text, calibration_findings reference.
@jramos jramos merged commit eb496c3 into main May 25, 2026
4 checks passed
@jramos jramos deleted the feat/path-d-gepa-acceptance-criterion branch May 25, 2026 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant