feat(evolve): default to gepa improvement-or-equal acceptance criterion (Path D) by jramos · Pull Request #74 · jramos/agent-self-evolution

jramos · 2026-05-25T00:28:44Z

Summary

Defaults acceptance_criterion to improvement_or_equal (was implicit strict in gepa<0.1.2) for GEPA's minibatch acceptance test. Adds --gepa-acceptance {strict-improvement,improvement-or-equal} CLI flag on evolve_skill and evolve_tool, plumbed through dspy.GEPA's gepa_kwargs passthrough to gepa.optimize. Closes the last remaining downstream item from the deploy-gate arc.

Why

GEPA's minibatch acceptance test at gepa/core/engine.py:493 historically hard-coded strict improvement (new_sum > old_sum). Under LM-judge noise on small minibatches (3-8 examples), this rejects "true zero-difference" candidates roughly half the time, narrowing the search and reducing downstream Pareto-frontier diversity.

The strict-vs-non-strict choice has no motivation in the GEPA paper (arXiv:2507.19457): Algorithm 1 just says "if σ′ improved" — undefined operator, no ablation, no discussion of minibatch noise. The paper's first author (Lakshya Agrawal) shipped this as a configurable choice in gepa-ai/gepa#304 with the explicit rationale that improvement-or-equal "allow[s] lateral moves that don't improve the score but may explore different regions of the solution space."

Adjacent literature (Beyer 2000, Aizawa & Wah 1994, Rakshit et al. 2017) treats strict-elitist acceptance under noisy fitness as a known anti-pattern. Improvement-or-equal is the lightest-touch mainstream fix.

Empirical validation (A/B smoke)

Same skill (nano-pdf), same seed (42), same everything else — only --gepa-acceptance varies:

Metric	strict-improvement	improvement-or-equal	Delta
`run_inputs.gepa_acceptance` (sanity)	`strict_improvement` ✓	`improvement_or_equal` ✓	kwarg reached gepa
Cost	$1.24	$1.26	basically free
Candidates accepted	5	10	+5 (2x as many)
Picked val score	0.860	0.860	tied
Picked body chars	1631	1797	+166 (+10%)
Holdout improvement Δ	+0.446	+0.487	+0.041 (~9% better)
Bootstrap lower bound	0.388	0.425	+0.037 (tighter CI)
Time	294.7s	328.9s	+34s (more candidates evaluated)

Improvement-or-equal accepted 2x as many candidates — exactly the prediction from theory. More candidates → better Pareto front → +0.041 holdout improvement and a higher bootstrap lower bound, at no extra cost.

Dependency setup

The gepa PR landed 2026-04-06 but no PyPI release contains it yet (latest gepa==0.1.1 was uploaded 2026-03-16; DSPy 3.2.1 still pins gepa==0.0.27). Bridged by:

Git-pinning gepa to PR #304's merge SHA 5e24ee5c8e1857a62a1ba19731de9da45ffb6f1b
[tool.uv] override-dependencies to bypass DSPy 3.2.0's hard-pin on gepa[dspy]==0.0.27

Documented inline in pyproject.toml. When gepa 0.1.2 ships (or DSPy bumps via stanfordnlp/dspy#9673, which is merged but unreleased), the git pin + override can be swapped to a version pin in a one-line change.

Bug caught during validation

Initial CLI choice was ["strict", "improvement-or-equal"]. The hyphen-to-underscore conversion produced "strict" for the first option, but gepa rejects unknown criteria — only "strict_improvement" and "improvement_or_equal" are valid. Caught by the A/B smoke (first attempt's strict run fell back to MIPROv2 → optuna ImportError), fixed by renaming the CLI value to strict-improvement.

Test plan

Unit tests pin the gepa_kwargs passthrough at the DSPy constructor (skill + tool sides, both criteria)
run_inputs records the gepa_acceptance value for forensic replay
Full suite green locally (1148 passed)
A/B smoke validates end-to-end behavior + kwarg reaches gepa correctly
CLI --help shows both choices with explanation
CI green across all Python versions

Files

pyproject.toml — git pin gepa to PR #304 merge SHA + uv override-dependencies
evolution/core/config.py — gepa_acceptance: str = "improvement_or_equal" field
evolution/skills/evolve_skill.py + evolution/tools/evolve_tool.py — CLI flag, plumbing
evolution/core/run_inputs.py — record gepa_acceptance in the run_inputs payload
tests/{skills,tools}/test_evolve_*_validation_flow.py — passthrough regression tests
tests/core/test_run_inputs.py — new field assertion
reports/calibration_findings.md — Path D section with the rationale

…fault

Adds the gepa_acceptance string ("strict" or "improvement_or_equal") to the run_inputs payload so a third party holding only gate_decision.json can tell which acceptance criterion produced the run. Threaded through all 5 build_run_inputs call sites; updated schema tests for the new key.

The shipped --gepa-acceptance flag offered "strict" as a choice, but gepa.optimize rejects that string and only accepts "strict_improvement" or "improvement_or_equal". The smoke surfaced this: a --gepa-acceptance strict run raised ValueError("Unknown acceptance_criterion: strict") inside gepa, triggering the MIPROv2 fallback path. Rename the CLI choice to "strict-improvement" so the hyphen→underscore conversion produces gepa's canonical "strict_improvement". Update tests, config docstring, help text, calibration_findings reference.

jramos added 6 commits May 24, 2026 14:32

chore(deps): pin gepa to PR-304 merge SHA for acceptance-criterion API

314b76f

feat(evolve): --gepa-acceptance CLI flag with improvement-or-equal de…

80e6831

…fault

test(evolve): pin gepa_kwargs passthrough for acceptance_criterion

0d793cf

docs: mark Path D resolved (gepa improvement-or-equal default)

68e89af

jramos merged commit eb496c3 into main May 25, 2026
4 checks passed

jramos deleted the feat/path-d-gepa-acceptance-criterion branch May 25, 2026 00:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evolve): default to gepa improvement-or-equal acceptance criterion (Path D)#74

feat(evolve): default to gepa improvement-or-equal acceptance criterion (Path D)#74
jramos merged 6 commits into
mainfrom
feat/path-d-gepa-acceptance-criterion

jramos commented May 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jramos commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Empirical validation (A/B smoke)

Dependency setup

Bug caught during validation

Test plan

Files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jramos commented May 25, 2026 •

edited

Loading