feat: skill-side deploy-gate CL-awareness (lockstep with tool side) by jramos · Pull Request #70 · jramos/agent-self-evolution

jramos · 2026-05-23T21:56:53Z

Summary

Ports the closed-loop-aware deploy gate to evolve_skill.py. Tool-side already shipped (CL-aware gate is on main); this PR brings skill side to schema v5 lockstep with the same trigger condition (band == weak_signal AND CL configured), the same _check_cl_primary_gate helper, the same v5 additive contract, and the same abort-path discipline.

Behavior

weak_signal band: runs closed_loop_cache.force_run(evolved_body) post-GEPA, gates on _check_cl_primary_gate (CL gain ≥ growth-scaled required + synthetic regression within tolerance) plus the preserved _check_absolute_char_ceiling (wallpaper protection).
All other bands (healthy, no_headroom, uniform_failure): today's synthetic-only gate path runs unchanged.
--no-saturation-check: falls through to synthetic gate; gate_decision.json records reason_synthetic: "preflight_skipped".

Skill-specific differences from the tool-side equivalent

force_run(evolved_body) — NOT force_run(evolved_full). The cache was constructed with baseline_artifact_text=skill["body"], so the deploy-time call must use the body (no YAML frontmatter) for the cache key to match the preflight baseline. Otherwise we silently double-spend on the evolved eval. Test 11 (test_force_run_called_with_skill_body_not_full) pins this contract with belt-and-suspenders substring negative checks (--- / name: absent from the call arg).
Abort paths write evolved_FAILED.md (not .json) — matches the existing skill-side rejection convention. Test 12 pins this.
V4 skill-specific payload fields (bap_max_growth, bap_safety_margin, eval_source, fitness_profile, proposer_mode, knee_point.band_roster[*].holdout_score) all preserved in v5. Test 13 pins this.
growth_pct computed on evolved_full vs skill["raw"] (matches existing skill-side validate_growth_with_quality convention).

Schema v5

Additive over v4. All 4 gate_decision.json write sites in evolve_skill.py now emit v5: static-fail, cl_eval_failed abort, cl_eval_incomplete abort, main success/reject. write_cost_ceiling_abort() call also bumped via the schema_version="5" kwarg added in PR #69. New fields surface only when CL-primary fired (same shape as tool side).

Commit sequence

refactor(evolve_skill): preserve SaturationReport fields for deploy gate — plumbing, no behavior change
feat(evolve_skill): branch deploy gate on saturation band — main behavior change + run_inputs hoist
feat(evolve_skill): gate_decision.json schema v5 with CL-primary fields — payload extension
test(evolve_skill): integration + schema regression for CL-aware gate — 13 tests
test(evolve_skill): pin CL-primary path in absolute-char-ceiling test — review feedback fix (the absolute_char_ceiling check turned out to actually live in the CL-primary branch, not in static gate — corrected the test to pin the verified production flow)
test: weakened systematic-debugging fixture for CL-aware-gate smoke — empirically-probed fixture that lands in weak_signal band

Empirical fixture-design finding

Task 5 surfaced a non-obvious empirical insight worth documenting: passive weakening (stripping methodology sections) of a skill does NOT move validator behavior on systematic_debugging.jsonl — gpt-5-mini gets 5/5 regardless of skill content because the user_message is explicit. gpt-5-nano gets 0/5 with any weakening. The only path to weak_signal was active misdirection — a skill that LOOKS plausible to the LLM judge (synthetic ≥ 0.95) but redirects the agent toward different behavior (write a report instead of editing). This is the same binary-tier issue that PR #68 worked around for the tool side; the active-misdirection technique is the skill-side analog.

Test plan

uv run pytest -q — 1130 passed locally (+13 from this PR: 10 tool-side analogs + 3 skill-specific guards)
CI green across 4 Python versions
Optional manual smoke (~$3–5): uv run python -m evolution.skills.evolve_skill --skill weakened-systematic-debugging --skill-source-dir tests/fixtures/skills --closed-loop-during-evolution evolution/validation/suites/systematic_debugging.jsonl --closed-loop-hermes-repo …/hermes-agent --closed-loop-agent-model openai/gpt-5-mini --iterations 10 --seed 42 --max-total-cost-usd 15 --closed-loop-mode trainset --closed-loop-in-valset --force-saturation-check --gepa-minibatch-size 8. Expected: decision: "deploy", decision_signal: "closed_loop".

Scope notes

Tool-side and skill-side now at lockstep schema v5.
Follow-up flagged: run_inputs literal is duplicated across evolve_skill.py and evolve_tool.py. After this PR there are 6 sites across 2 files. Worth extracting to evolution/core/run_inputs.py in a separate cleanup PR; not in scope here.
Memory-worthy finding (post-merge): the active-misdirection weakening technique for test_command-based skill suites. Add a project memory entry so future skill-side empirical work doesn't waste time on passive weakening that won't move validator behavior.

Symmetric to evolve_tool's CL-aware gate work. Today only sat_report.holdout_per_example survives past the preflight call site; the CL-aware deploy gate work needs band + cl_per_example + the two trigger scores too. Bind four new locals; all default to None on the --no-saturation-check path so the deploy gate can branch safely. No behavior change; existing tests pass unchanged.

Symmetric to evolve_tool's CL-aware gate (already on main). When preflight reports weak_signal AND closed-loop is configured, run force_run(evolved_body) post-GEPA and gate on _check_cl_primary_gate. Three abort paths writing schema-v5 gate_decision.json: - cl_eval_failed: force_run raised - cl_eval_incomplete: evolved CL task abstained (runner errored — distinguished from genuine task failure via TaskResult.abstained) - cl_primary_gate reject: returned by the gate helper itself _check_absolute_chars preserved in CL-primary path; all other bands fall through to today's synthetic path unchanged. Key skill-specific differences from tool-side: - force_run called with evolved_body (no YAML frontmatter) to match the cache key the preflight set up with baseline body. - Abort path writes evolved_FAILED.md (matching existing skill reject convention), not evolved_FAILED.json. - growth_pct uses evolved_full vs skill["raw"] (existing convention). run_inputs literal hoisted to a local since the same dict now appears in 3 places (success + 2 abort paths).

Schema bumps from v4 to v5 (additive). New fields: decision_signal (always), baseline_closed_loop_per_example, evolved_closed_loop_per_example, evolved_closed_loop_errored_tasks, cl_tasks_gained, cl_required_gain, synthetic_sanity_check, evolved_cl_eval_cost_usd, band_trigger_score, validator_agent_model (only when use_cl_primary fired). reason_synthetic: 'preflight_skipped' when --no-saturation-check was passed, so downstream consumers can distinguish 'preflight saw no weak_signal' from 'preflight didn't run.' Existing v4 consumers see byte-identical output for synthetic-mode skill runs aside from the new decision_signal string. v4-specific skill fields (bap_max_growth, bap_safety_margin, eval_source, fitness_profile, proposer_mode, knee_point.band_roster[*].holdout_score) all preserved. Lockstep with evolve_tool: both at schema v5 after this lands.

13 tests symmetric to tests/tools/test_evolve_tool_cl_aware_gate.py plus 3 skill-specific guards: - force_run called with body, not full (frontmatter key bug guard) - evolved_FAILED.md (not .json) on abort - v4 skill payload fields (bap_*, eval_source, fitness_profile, proposer_mode, knee_point.band_roster) preserved in v5 Mocks via the existing test_evolve_skill_saturation_preflight pattern. Calls evolve() directly rather than via CliRunner (matches the pattern PR #69 settled on after its own CliRunner deviation).

Code review flagged test 10 as misnaming/misasserting the path under test (claiming static gate at evolve_skill.py:1034 fires first with decision_signal='synthetic'). Empirical verification shows validate_static at line 1034 only runs size_limit/non_empty/skill_structure — the absolute_char_ceiling check lives at line 1271 inside the use_cl_primary branch and runs AFTER force_run. Rejection carries decision_signal='closed_loop'. Keep the original (accurate) test name, expand the docstring to spell out the verified production flow, and add two assertions that pin the real CL-primary path: decision_signal == 'closed_loop' and force_run.assert_called_once_with(body). Also fix a minor miscount in the module docstring frontmatter math (was '43 chars', actually 42).

Lands in weak_signal band against evolution/validation/suites/systematic_debugging.jsonl at seed 42 with openai/gpt-5-mini validator. Mirrors tests/fixtures/tool_manifests/weakened_write_file_manifest.json from the tool-side work — checked in so the post-merge manual smoke for the CL-aware skill deploy gate is reproducible. The fixture is a misdirection variant ("Python Bug Diagnostician") that looks plausible to the LLM judge (synthetic holdout ~0.95) while instructing the agent to write a diagnostic report instead of editing the file, so the planted-bug closed-loop suite scores in the 0.15-0.95 band rather than saturating. Empirical probe results (seed 42, eval-model gpt-4.1-mini, closed-loop-agent-model gpt-5-mini, --eval-dataset-size 50): Probe 1: Band=weak_signal, Holdout=0.957, CL=0.800 (4/5) Probe 2: Band=weak_signal, Holdout=0.957, CL=0.200 (1/5) Probe 3: Band=weak_signal, Holdout=0.969, CL=0.600 (3/5) CL pass-rate varies run-to-run because the 5-task suite has 0.2 granularity and gpt-5-mini's adherence to the "don't edit" framing is non-deterministic, but every probe landed in the weak_signal band, which is what the manual smoke needs to exercise the CL-primary gate path. Earlier probes confirmed the binary-tier saturation trap from prior tool-side work: - Passive weakenings (v1: stripped methodology, v3: one-sentence skill) both landed in no_headroom with gpt-5-mini (CL=1.0) and uniform_failure with gpt-5-nano (CL=0.0). The validator model capacity, not the skill content, drives the closed-loop signal on these planted bugs. - Only active misdirection (this fixture) creates the synthetic / closed-loop disagreement the weak_signal band requires.

jramos added 6 commits May 23, 2026 14:03

jramos merged commit a9fe000 into main May 24, 2026
4 checks passed

jramos deleted the feat/skill-deploy-gate-cl-awareness branch May 24, 2026 01:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: skill-side deploy-gate CL-awareness (lockstep with tool side)#70

feat: skill-side deploy-gate CL-awareness (lockstep with tool side)#70
jramos merged 6 commits into
mainfrom
feat/skill-deploy-gate-cl-awareness

jramos commented May 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jramos commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavior

Skill-specific differences from the tool-side equivalent

Schema v5

Commit sequence

Empirical fixture-design finding

Test plan

Scope notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jramos commented May 23, 2026 •

edited

Loading