feat: skill-side deploy-gate CL-awareness (lockstep with tool side)#70
Merged
Conversation
Symmetric to evolve_tool's CL-aware gate work. Today only sat_report.holdout_per_example survives past the preflight call site; the CL-aware deploy gate work needs band + cl_per_example + the two trigger scores too. Bind four new locals; all default to None on the --no-saturation-check path so the deploy gate can branch safely. No behavior change; existing tests pass unchanged.
Symmetric to evolve_tool's CL-aware gate (already on main). When
preflight reports weak_signal AND closed-loop is configured, run
force_run(evolved_body) post-GEPA and gate on _check_cl_primary_gate.
Three abort paths writing schema-v5 gate_decision.json:
- cl_eval_failed: force_run raised
- cl_eval_incomplete: evolved CL task abstained (runner errored —
distinguished from genuine task failure via TaskResult.abstained)
- cl_primary_gate reject: returned by the gate helper itself
_check_absolute_chars preserved in CL-primary path; all other bands
fall through to today's synthetic path unchanged.
Key skill-specific differences from tool-side:
- force_run called with evolved_body (no YAML frontmatter) to match
the cache key the preflight set up with baseline body.
- Abort path writes evolved_FAILED.md (matching existing skill reject
convention), not evolved_FAILED.json.
- growth_pct uses evolved_full vs skill["raw"] (existing convention).
run_inputs literal hoisted to a local since the same dict now appears
in 3 places (success + 2 abort paths).
Schema bumps from v4 to v5 (additive). New fields: decision_signal (always), baseline_closed_loop_per_example, evolved_closed_loop_per_example, evolved_closed_loop_errored_tasks, cl_tasks_gained, cl_required_gain, synthetic_sanity_check, evolved_cl_eval_cost_usd, band_trigger_score, validator_agent_model (only when use_cl_primary fired). reason_synthetic: 'preflight_skipped' when --no-saturation-check was passed, so downstream consumers can distinguish 'preflight saw no weak_signal' from 'preflight didn't run.' Existing v4 consumers see byte-identical output for synthetic-mode skill runs aside from the new decision_signal string. v4-specific skill fields (bap_max_growth, bap_safety_margin, eval_source, fitness_profile, proposer_mode, knee_point.band_roster[*].holdout_score) all preserved. Lockstep with evolve_tool: both at schema v5 after this lands.
13 tests symmetric to tests/tools/test_evolve_tool_cl_aware_gate.py
plus 3 skill-specific guards:
- force_run called with body, not full (frontmatter key bug guard)
- evolved_FAILED.md (not .json) on abort
- v4 skill payload fields (bap_*, eval_source, fitness_profile,
proposer_mode, knee_point.band_roster) preserved in v5
Mocks via the existing test_evolve_skill_saturation_preflight pattern.
Calls evolve() directly rather than via CliRunner (matches the pattern
PR #69 settled on after its own CliRunner deviation).
Code review flagged test 10 as misnaming/misasserting the path under test (claiming static gate at evolve_skill.py:1034 fires first with decision_signal='synthetic'). Empirical verification shows validate_static at line 1034 only runs size_limit/non_empty/skill_structure — the absolute_char_ceiling check lives at line 1271 inside the use_cl_primary branch and runs AFTER force_run. Rejection carries decision_signal='closed_loop'. Keep the original (accurate) test name, expand the docstring to spell out the verified production flow, and add two assertions that pin the real CL-primary path: decision_signal == 'closed_loop' and force_run.assert_called_once_with(body). Also fix a minor miscount in the module docstring frontmatter math (was '43 chars', actually 42).
Lands in weak_signal band against evolution/validation/suites/systematic_debugging.jsonl
at seed 42 with openai/gpt-5-mini validator. Mirrors
tests/fixtures/tool_manifests/weakened_write_file_manifest.json from
the tool-side work — checked in so the post-merge manual smoke for the
CL-aware skill deploy gate is reproducible.
The fixture is a misdirection variant ("Python Bug Diagnostician") that
looks plausible to the LLM judge (synthetic holdout ~0.95) while
instructing the agent to write a diagnostic report instead of editing
the file, so the planted-bug closed-loop suite scores in the
0.15-0.95 band rather than saturating.
Empirical probe results (seed 42, eval-model gpt-4.1-mini,
closed-loop-agent-model gpt-5-mini, --eval-dataset-size 50):
Probe 1: Band=weak_signal, Holdout=0.957, CL=0.800 (4/5)
Probe 2: Band=weak_signal, Holdout=0.957, CL=0.200 (1/5)
Probe 3: Band=weak_signal, Holdout=0.969, CL=0.600 (3/5)
CL pass-rate varies run-to-run because the 5-task suite has 0.2
granularity and gpt-5-mini's adherence to the "don't edit" framing is
non-deterministic, but every probe landed in the weak_signal band,
which is what the manual smoke needs to exercise the CL-primary gate
path.
Earlier probes confirmed the binary-tier saturation trap from prior
tool-side work:
- Passive weakenings (v1: stripped methodology, v3: one-sentence
skill) both landed in no_headroom with gpt-5-mini (CL=1.0) and
uniform_failure with gpt-5-nano (CL=0.0). The validator model
capacity, not the skill content, drives the closed-loop signal on
these planted bugs.
- Only active misdirection (this fixture) creates the synthetic /
closed-loop disagreement the weak_signal band requires.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports the closed-loop-aware deploy gate to
evolve_skill.py. Tool-side already shipped (CL-aware gate is onmain); this PR brings skill side to schema v5 lockstep with the same trigger condition (band == weak_signalAND CL configured), the same_check_cl_primary_gatehelper, the same v5 additive contract, and the same abort-path discipline.Behavior
weak_signalband: runsclosed_loop_cache.force_run(evolved_body)post-GEPA, gates on_check_cl_primary_gate(CL gain ≥ growth-scaled required + synthetic regression within tolerance) plus the preserved_check_absolute_char_ceiling(wallpaper protection).healthy,no_headroom,uniform_failure): today's synthetic-only gate path runs unchanged.--no-saturation-check: falls through to synthetic gate;gate_decision.jsonrecordsreason_synthetic: "preflight_skipped".Skill-specific differences from the tool-side equivalent
force_run(evolved_body)— NOTforce_run(evolved_full). The cache was constructed withbaseline_artifact_text=skill["body"], so the deploy-time call must use the body (no YAML frontmatter) for the cache key to match the preflight baseline. Otherwise we silently double-spend on the evolved eval. Test 11 (test_force_run_called_with_skill_body_not_full) pins this contract with belt-and-suspenders substring negative checks (---/name:absent from the call arg).evolved_FAILED.md(not.json) — matches the existing skill-side rejection convention. Test 12 pins this.bap_max_growth,bap_safety_margin,eval_source,fitness_profile,proposer_mode,knee_point.band_roster[*].holdout_score) all preserved in v5. Test 13 pins this.growth_pctcomputed onevolved_fullvsskill["raw"](matches existing skill-sidevalidate_growth_with_qualityconvention).Schema v5
Additive over v4. All 4
gate_decision.jsonwrite sites inevolve_skill.pynow emit v5: static-fail,cl_eval_failedabort,cl_eval_incompleteabort, main success/reject.write_cost_ceiling_abort()call also bumped via theschema_version="5"kwarg added in PR #69. New fields surface only when CL-primary fired (same shape as tool side).Commit sequence
refactor(evolve_skill): preserve SaturationReport fields for deploy gate— plumbing, no behavior changefeat(evolve_skill): branch deploy gate on saturation band— main behavior change +run_inputshoistfeat(evolve_skill): gate_decision.json schema v5 with CL-primary fields— payload extensiontest(evolve_skill): integration + schema regression for CL-aware gate— 13 teststest(evolve_skill): pin CL-primary path in absolute-char-ceiling test— review feedback fix (the absolute_char_ceiling check turned out to actually live in the CL-primary branch, not in static gate — corrected the test to pin the verified production flow)test: weakened systematic-debugging fixture for CL-aware-gate smoke— empirically-probed fixture that lands inweak_signalbandEmpirical fixture-design finding
Task 5 surfaced a non-obvious empirical insight worth documenting: passive weakening (stripping methodology sections) of a skill does NOT move validator behavior on
systematic_debugging.jsonl— gpt-5-mini gets 5/5 regardless of skill content because the user_message is explicit. gpt-5-nano gets 0/5 with any weakening. The only path toweak_signalwas active misdirection — a skill that LOOKS plausible to the LLM judge (synthetic ≥ 0.95) but redirects the agent toward different behavior (write a report instead of editing). This is the same binary-tier issue that PR #68 worked around for the tool side; the active-misdirection technique is the skill-side analog.Test plan
uv run pytest -q— 1130 passed locally (+13 from this PR: 10 tool-side analogs + 3 skill-specific guards)uv run python -m evolution.skills.evolve_skill --skill weakened-systematic-debugging --skill-source-dir tests/fixtures/skills --closed-loop-during-evolution evolution/validation/suites/systematic_debugging.jsonl --closed-loop-hermes-repo …/hermes-agent --closed-loop-agent-model openai/gpt-5-mini --iterations 10 --seed 42 --max-total-cost-usd 15 --closed-loop-mode trainset --closed-loop-in-valset --force-saturation-check --gepa-minibatch-size 8. Expected:decision: "deploy",decision_signal: "closed_loop".Scope notes
run_inputsliteral is duplicated acrossevolve_skill.pyandevolve_tool.py. After this PR there are 6 sites across 2 files. Worth extracting toevolution/core/run_inputs.pyin a separate cleanup PR; not in scope here.test_command-based skill suites. Add a project memory entry so future skill-side empirical work doesn't waste time on passive weakening that won't move validator behavior.