Skip to content

feat: skill-side deploy-gate CL-awareness (lockstep with tool side)#70

Merged
jramos merged 6 commits into
mainfrom
feat/skill-deploy-gate-cl-awareness
May 24, 2026
Merged

feat: skill-side deploy-gate CL-awareness (lockstep with tool side)#70
jramos merged 6 commits into
mainfrom
feat/skill-deploy-gate-cl-awareness

Conversation

@jramos
Copy link
Copy Markdown
Owner

@jramos jramos commented May 23, 2026

Summary

Ports the closed-loop-aware deploy gate to evolve_skill.py. Tool-side already shipped (CL-aware gate is on main); this PR brings skill side to schema v5 lockstep with the same trigger condition (band == weak_signal AND CL configured), the same _check_cl_primary_gate helper, the same v5 additive contract, and the same abort-path discipline.

Behavior

  • weak_signal band: runs closed_loop_cache.force_run(evolved_body) post-GEPA, gates on _check_cl_primary_gate (CL gain ≥ growth-scaled required + synthetic regression within tolerance) plus the preserved _check_absolute_char_ceiling (wallpaper protection).
  • All other bands (healthy, no_headroom, uniform_failure): today's synthetic-only gate path runs unchanged.
  • --no-saturation-check: falls through to synthetic gate; gate_decision.json records reason_synthetic: "preflight_skipped".

Skill-specific differences from the tool-side equivalent

  • force_run(evolved_body) — NOT force_run(evolved_full). The cache was constructed with baseline_artifact_text=skill["body"], so the deploy-time call must use the body (no YAML frontmatter) for the cache key to match the preflight baseline. Otherwise we silently double-spend on the evolved eval. Test 11 (test_force_run_called_with_skill_body_not_full) pins this contract with belt-and-suspenders substring negative checks (--- / name: absent from the call arg).
  • Abort paths write evolved_FAILED.md (not .json) — matches the existing skill-side rejection convention. Test 12 pins this.
  • V4 skill-specific payload fields (bap_max_growth, bap_safety_margin, eval_source, fitness_profile, proposer_mode, knee_point.band_roster[*].holdout_score) all preserved in v5. Test 13 pins this.
  • growth_pct computed on evolved_full vs skill["raw"] (matches existing skill-side validate_growth_with_quality convention).

Schema v5

Additive over v4. All 4 gate_decision.json write sites in evolve_skill.py now emit v5: static-fail, cl_eval_failed abort, cl_eval_incomplete abort, main success/reject. write_cost_ceiling_abort() call also bumped via the schema_version="5" kwarg added in PR #69. New fields surface only when CL-primary fired (same shape as tool side).

Commit sequence

  1. refactor(evolve_skill): preserve SaturationReport fields for deploy gate — plumbing, no behavior change
  2. feat(evolve_skill): branch deploy gate on saturation band — main behavior change + run_inputs hoist
  3. feat(evolve_skill): gate_decision.json schema v5 with CL-primary fields — payload extension
  4. test(evolve_skill): integration + schema regression for CL-aware gate — 13 tests
  5. test(evolve_skill): pin CL-primary path in absolute-char-ceiling test — review feedback fix (the absolute_char_ceiling check turned out to actually live in the CL-primary branch, not in static gate — corrected the test to pin the verified production flow)
  6. test: weakened systematic-debugging fixture for CL-aware-gate smoke — empirically-probed fixture that lands in weak_signal band

Empirical fixture-design finding

Task 5 surfaced a non-obvious empirical insight worth documenting: passive weakening (stripping methodology sections) of a skill does NOT move validator behavior on systematic_debugging.jsonl — gpt-5-mini gets 5/5 regardless of skill content because the user_message is explicit. gpt-5-nano gets 0/5 with any weakening. The only path to weak_signal was active misdirection — a skill that LOOKS plausible to the LLM judge (synthetic ≥ 0.95) but redirects the agent toward different behavior (write a report instead of editing). This is the same binary-tier issue that PR #68 worked around for the tool side; the active-misdirection technique is the skill-side analog.

Test plan

  • uv run pytest -q — 1130 passed locally (+13 from this PR: 10 tool-side analogs + 3 skill-specific guards)
  • CI green across 4 Python versions
  • Optional manual smoke (~$3–5): uv run python -m evolution.skills.evolve_skill --skill weakened-systematic-debugging --skill-source-dir tests/fixtures/skills --closed-loop-during-evolution evolution/validation/suites/systematic_debugging.jsonl --closed-loop-hermes-repo …/hermes-agent --closed-loop-agent-model openai/gpt-5-mini --iterations 10 --seed 42 --max-total-cost-usd 15 --closed-loop-mode trainset --closed-loop-in-valset --force-saturation-check --gepa-minibatch-size 8. Expected: decision: "deploy", decision_signal: "closed_loop".

Scope notes

  • Tool-side and skill-side now at lockstep schema v5.
  • Follow-up flagged: run_inputs literal is duplicated across evolve_skill.py and evolve_tool.py. After this PR there are 6 sites across 2 files. Worth extracting to evolution/core/run_inputs.py in a separate cleanup PR; not in scope here.
  • Memory-worthy finding (post-merge): the active-misdirection weakening technique for test_command-based skill suites. Add a project memory entry so future skill-side empirical work doesn't waste time on passive weakening that won't move validator behavior.

jramos added 6 commits May 23, 2026 14:03
Symmetric to evolve_tool's CL-aware gate work. Today only
sat_report.holdout_per_example survives past the preflight call site;
the CL-aware deploy gate work needs band + cl_per_example + the two
trigger scores too. Bind four new locals; all default to None on the
--no-saturation-check path so the deploy gate can branch safely.

No behavior change; existing tests pass unchanged.
Symmetric to evolve_tool's CL-aware gate (already on main). When
preflight reports weak_signal AND closed-loop is configured, run
force_run(evolved_body) post-GEPA and gate on _check_cl_primary_gate.

Three abort paths writing schema-v5 gate_decision.json:
  - cl_eval_failed: force_run raised
  - cl_eval_incomplete: evolved CL task abstained (runner errored —
    distinguished from genuine task failure via TaskResult.abstained)
  - cl_primary_gate reject: returned by the gate helper itself

_check_absolute_chars preserved in CL-primary path; all other bands
fall through to today's synthetic path unchanged.

Key skill-specific differences from tool-side:
- force_run called with evolved_body (no YAML frontmatter) to match
  the cache key the preflight set up with baseline body.
- Abort path writes evolved_FAILED.md (matching existing skill reject
  convention), not evolved_FAILED.json.
- growth_pct uses evolved_full vs skill["raw"] (existing convention).

run_inputs literal hoisted to a local since the same dict now appears
in 3 places (success + 2 abort paths).
Schema bumps from v4 to v5 (additive). New fields:
  decision_signal (always), baseline_closed_loop_per_example,
  evolved_closed_loop_per_example, evolved_closed_loop_errored_tasks,
  cl_tasks_gained, cl_required_gain, synthetic_sanity_check,
  evolved_cl_eval_cost_usd, band_trigger_score, validator_agent_model
  (only when use_cl_primary fired).

reason_synthetic: 'preflight_skipped' when --no-saturation-check was
passed, so downstream consumers can distinguish 'preflight saw no
weak_signal' from 'preflight didn't run.'

Existing v4 consumers see byte-identical output for synthetic-mode
skill runs aside from the new decision_signal string. v4-specific
skill fields (bap_max_growth, bap_safety_margin, eval_source,
fitness_profile, proposer_mode, knee_point.band_roster[*].holdout_score)
all preserved.

Lockstep with evolve_tool: both at schema v5 after this lands.
13 tests symmetric to tests/tools/test_evolve_tool_cl_aware_gate.py
plus 3 skill-specific guards:
  - force_run called with body, not full (frontmatter key bug guard)
  - evolved_FAILED.md (not .json) on abort
  - v4 skill payload fields (bap_*, eval_source, fitness_profile,
    proposer_mode, knee_point.band_roster) preserved in v5

Mocks via the existing test_evolve_skill_saturation_preflight pattern.
Calls evolve() directly rather than via CliRunner (matches the pattern
PR #69 settled on after its own CliRunner deviation).
Code review flagged test 10 as misnaming/misasserting the path under
test (claiming static gate at evolve_skill.py:1034 fires first with
decision_signal='synthetic'). Empirical verification shows
validate_static at line 1034 only runs size_limit/non_empty/skill_structure
— the absolute_char_ceiling check lives at line 1271 inside the
use_cl_primary branch and runs AFTER force_run. Rejection carries
decision_signal='closed_loop'.

Keep the original (accurate) test name, expand the docstring to spell
out the verified production flow, and add two assertions that pin the
real CL-primary path: decision_signal == 'closed_loop' and
force_run.assert_called_once_with(body).

Also fix a minor miscount in the module docstring frontmatter math
(was '43 chars', actually 42).
Lands in weak_signal band against evolution/validation/suites/systematic_debugging.jsonl
at seed 42 with openai/gpt-5-mini validator. Mirrors
tests/fixtures/tool_manifests/weakened_write_file_manifest.json from
the tool-side work — checked in so the post-merge manual smoke for the
CL-aware skill deploy gate is reproducible.

The fixture is a misdirection variant ("Python Bug Diagnostician") that
looks plausible to the LLM judge (synthetic holdout ~0.95) while
instructing the agent to write a diagnostic report instead of editing
the file, so the planted-bug closed-loop suite scores in the
0.15-0.95 band rather than saturating.

Empirical probe results (seed 42, eval-model gpt-4.1-mini,
closed-loop-agent-model gpt-5-mini, --eval-dataset-size 50):
  Probe 1: Band=weak_signal, Holdout=0.957, CL=0.800 (4/5)
  Probe 2: Band=weak_signal, Holdout=0.957, CL=0.200 (1/5)
  Probe 3: Band=weak_signal, Holdout=0.969, CL=0.600 (3/5)

CL pass-rate varies run-to-run because the 5-task suite has 0.2
granularity and gpt-5-mini's adherence to the "don't edit" framing is
non-deterministic, but every probe landed in the weak_signal band,
which is what the manual smoke needs to exercise the CL-primary gate
path.

Earlier probes confirmed the binary-tier saturation trap from prior
tool-side work:
  - Passive weakenings (v1: stripped methodology, v3: one-sentence
    skill) both landed in no_headroom with gpt-5-mini (CL=1.0) and
    uniform_failure with gpt-5-nano (CL=0.0). The validator model
    capacity, not the skill content, drives the closed-loop signal on
    these planted bugs.
  - Only active misdirection (this fixture) creates the synthetic /
    closed-loop disagreement the weak_signal band requires.
@jramos jramos merged commit a9fe000 into main May 24, 2026
4 checks passed
@jramos jramos deleted the feat/skill-deploy-gate-cl-awareness branch May 24, 2026 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant