feat: --gepa-minibatch-size CLI flag (Path E)#65
Merged
Conversation
Exposes GEPA's existing reflection_minibatch_size kwarg as a CLI flag so users hitting the saturation pre-flight's weak_signal band can widen the sampling window without an upstream PR. Background: GEPA's acceptance gate is sum(subsample_scores) over a small random minibatch (gepa/core/engine.py:491-493). At the default minibatch=3, on a saturated baseline with ~15% failure rate, the discriminating examples appear in only ~34% of minibatches (spike #2 in reports/pareto_frontier_feasibility.md: 40 proposals rejected over 116 GEPA iterations with sum(N.0) not better than N.0 patterns). Bumping minibatch to 8 raises that probability to ~68%, giving the acceptance gate the contrast it needs. Changes: - evolution/core/config.py: new EvolutionConfig.reflection_minibatch_size field (default 3, matches GEPA's own default). - evolution/skills/evolve_skill.py + evolution/tools/evolve_tool.py: new --gepa-minibatch-size click option with IntRange(min=1) validation. Threaded through main → evolve → EvolutionConfig → dspy.GEPA(reflection_minibatch_size=...). Help text is pipeline-aware: tool side recommends --iterations bump (uses max_full_evals); skill side recommends --budget heavy (uses auto). - Both pipelines: post-dataset-build guard that aborts at startup if --gepa-minibatch-size exceeds the trainset size, with an actionable message. Without the guard, GEPA's EpochShuffledBatchSampler asserts mid-optimization at gepa/strategies/batch_sampler.py:71. - evolution/core/saturation_check.py: weak_signal band suggestions now recommend the specific flag (replaces the "Path E follow-up would help once landed" placeholder). - Tests: new TestGepaMinibatchSizeFlag classes in both pipelines — patches dspy.GEPA.__init__ to verify the kwarg reaches self.reflection_minibatch_size post-construction (catches future DSPy renames), plus a test that the trainset-ceiling guard fires with the expected message and exit code 1. Default unchanged at 3: no behavior change for existing scripts / CI. Users hitting weak_signal get the panel telling them to bump to 8 + pipeline-specific budget compensation. Full suite: 1082 passed (was 1078 → +4 new tests). Verified locally with env -i ... OPENAI_API_KEY=sk-fake-test-key uv run pytest to match CI conditions. Implements Path E from reports/pareto_frontier_feasibility.md. Path D (Pareto-dominance acceptance) and Path C (stratified sampling) remain future work; Path C is shippable without an upstream PR via dspy.GEPA(gepa_kwargs={"batch_sampler": ...}) if Path E proves insufficient on harder cases.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
reflection_minibatch_sizekwarg as--gepa-minibatch-sizeon bothevolve_skillandevolve_tool. Default unchanged at 3 (matches GEPA's own default; no behavior change for existing scripts/CI). Users hitting the saturation pre-flight'sweak_signalband now get the panel telling them to bump to 8 + pipeline-specific budget compensation.EvolutionConfig.reflection_minibatch_sizefield (canonical home, auto-serialized intometrics.json).--gepa-minibatch-sizeexceeds the trainset size. Without the guard, GEPA'sEpochShuffledBatchSamplerasserts mid-optimization atgepa/strategies/batch_sampler.py:71.saturation_check.py'sweak_signalsuggestions to recommend the new flag concretely (replaces the "Path E follow-up would help once landed" placeholder).--iterationsbump (usesmax_full_evals); skill side recommends--budget heavy(usesauto).Background
Implements Path E from
reports/pareto_frontier_feasibility.md. Spike #2 showed that on a saturated baseline with ~15% behavioral failure rate, GEPA'ssum(subsample_scores)acceptance gate rejected all 40 candidate proposals over 116 iterations — because a random 3-example minibatch contains a discriminating example only ~34% of the time (hypergeometric, K=7 / N=56). Bumping to 8 raises that to ~68%. Path E ships the user-facing knob; Path F (saturation pre-flight, already merged) is the user-facing surface that recommends it.Test plan
uv run pytest -q— expect 1082 passed (up from 1078 pre-branch, +4 new tests acrossTestGepaMinibatchSizeFlagin both pipelines).uv run python -m evolution.tools.evolve_tool --help | grep -A 12 gepa-minibatchshows the tool-side help mentioning--iterations;uv run python -m evolution.skills.evolve_skill --help | grep -A 12 gepa-minibatchshows the skill-side mentioning--budget heavy.--gepa-minibatch-size 1000against a tiny synthetic dataset; expect a clean exit 1 with"exceeds trainset size"in stderr instead of a deep GEPA assertion.--gepa-minibatch-size 8(success: ≥2 of 3 produce at least one accepted proposal) + 1 reverse-control at--gepa-minibatch-size 3(success: reproduces the spike fix: bump DSPy 3.0→3.2 and make GEPA actually run #2 all-rejected pattern). Verifies the mechanism actually moves selection.Scope notes
dspy.GEPA(gepa_kwargs={"batch_sampler": StratifiedBatchSampler(...)})if Path E proves insufficient on harder cases.closed_loop_*triple,DEFAULT_THRESHOLDS→ frozen dataclass) is a separate follow-up, not in this PR.