feat: skill-side closed-loop validation for systematic-debugging by jramos · Pull Request #62 · jramos/agent-self-evolution

jramos · 2026-05-17T21:23:12Z

Summary

Brings closed-loop validation to the skill-evolution path. Five commits, all on top of the closed-loop infrastructure shipped on the tool path. Closes PLAN.md Phase 1 deviation #3 for systematic-debugging; arxiv known-paper-recall and github-code-review planted-issue remain deferred (different verdict mechanisms).

The framework can now evaluate evolved skills not just by LM-judge composite score but by whether following the skill actually causes a real Hermes agent to debug + fix a planted bug correctly.

Commits:

test_command verdict mode for skill-shaped tasks — extends Task with optional test_command; score_task branches on it (run the command in fixture_dir, pass iff exit code zero). Tool-side scoring unchanged.
SkillFileInstaller + cache artifact_writer injection + runner skills-source plumbing — installer copies baseline skill into a writable workdir, runner stages it into each per-task sandbox via TaskRunContext.skills_src. ClosedLoopFeedbackCache parameterized to accept any artifact writer (default = current MCP manifest, preserving tool path).
systematic-debugging planted-bug suite (5 tasks: off-by-one, inverted condition, missing base case, wrong operator, mutable default arg) + fake target skill fixture (deliberately weak baseline so GEPA has headroom). Sanity tests verify every planted bug is real and every known fix passes.
Wire closed-loop end-to-end through evolve_skill — five --closed-loop-* CLI flags mirroring evolve_tool, _maybe_build_closed_loop_cache_skill helper, SkillModule.forward behavioral branch, behavioral-example injection into trainset/valset, metric construction with the cache.
Docs + manual smoke harness + PLAN deviation closure.

Test coverage: 1004 passing (was 950; +54 new). All mock-only in CI. Manual smoke at tests/manual/skill_closed_loop_smoke.py drives the closed-loop layer end-to-end against real hermes -z subprocesses (10 calls per run, real LM spend; not in CI).

Default --closed-loop-mode is feedback, not trainset. Skill bodies mutate heavily, so trainset mode's gate_mode=always (validator fires on every novel candidate) is 3-10x more expensive than on the tool path. Users opt into trainset explicitly when the cost is acceptable.

Manual validation outcome

Stage 1 smoke confirmed the closed-loop wiring works end-to-end against real hermes -z subprocesses:

Decision: pass
Baseline:  5/5 (100%)
Evolved:   5/5 (100%)
Per-task:  0 wins, 0 losses, 5 ties

All 10 hermes calls completed cleanly in ~4 min, candidate-text cache dedup'd on the second invocation, no orphaned per-task tmpdirs.

The 5/5 baseline pass rate is itself the headline finding: a reasoning-class model (gpt-5.4-mini in this case) solves the textbook planted bugs even with the deliberately weak baseline skill text. This is the saturation outcome (#3 from the design plan's three-outcome triage) — proves the framework is wired correctly but tells us the bug suite is too easy to discriminate skill text variants on capable models.

Stage 2 (heavy-budget GEPA run with --closed-loop-mode trainset) was skipped because the saturated baseline guarantees zero behavioral signal for GEPA to act on. Two follow-ups are scoped to fix this in a future PR: (a) add an optional --model kwarg to HermesAgentRunner so closed-loop validation can run against a deliberately weaker agent model independently of the user's Hermes default; (b) expand the suite with harder bugs (subtle binary-search off-by-one, generator exhaustion, mutable cached return, float-precision equality, async/state ordering) that survive a textbook debugging methodology.

Test plan

uv run pytest tests/ -q --ignore=tests/manual — 1004 passed
CI green on Python 3.10–3.13 (4 versions, mock-only tests)
Tool-path regression check: test_closed_loop_feedback.py, test_validator.py, test_safety.py, test_evolve_tool_closed_loop.py all green after cache kwarg rename + verify_backup Protocol extension
Manual smoke (tests/manual/skill_closed_loop_smoke.py) — passed end-to-end, ~4 min, all assertions green; wiring confirmed via real hermes -z × 10
Heavy-budget GEPA classification — deferred; baseline saturates at 5/5 against the configured reasoning model, leaving GEPA no behavioral signal to act on. Follow-up PR adds model override + harder bugs to make the classification meaningful.

Tool-side closed-loop scoring is a tool-call membership test (was an expected tool invoked, was a forbidden tool avoided). Skill-side suites need a different verdict: did the agent's edits make a planted test pass. Extends Task with an optional test_command field. When set, score_task runs the command in fixture_dir after the agent finishes and passes iff exit code is zero — tool-call membership is bypassed. Command failure modes (nonzero exit, timeout, FileNotFoundError) all map to "not passed" since the verdict is exit-code-driven. Runner errors still abstain. Uses shlex.split (no shell) so suite-controlled commands don't pick up shell metacharacters. Default timeout 60s. Tool-side suites (patch, write_file, search_files) are unchanged — they never set test_command and follow the existing tool-call rule.

Adds the plumbing needed to run closed-loop validation on a SKILL.md, not just a tool description. Three coordinated pieces: SkillFileInstaller copies the baseline skill directory (SKILL.md + any siblings) into a caller-owned writable workdir at construction. install() then atomically overwrites the target with candidate text. The original on-disk skill — possibly in a read-only plugin cache — is never touched. The installer exposes `skills_src`, the directory the runner needs to stage into its per-task sandbox so `hermes -z` discovers the candidate. ClosedLoopFeedbackCache grows an `artifact_writer` callable kwarg (default = the existing single-tool MCP manifest JSON, preserving the tool path bit-for-bit) and an `artifact_suffix` for the temp filenames. Skill-side callers inject `write_text_artifact` and `.md`. Constructor kwargs renamed `tool_name`→`artifact_name` and `baseline_description`→`baseline_artifact_text` for semantic neutrality; tool-side call site updated. HermesAgentRunner._prime_sandbox reads `TaskRunContext.skills_src` (new optional field) and copies its contents into `<sandbox>/skills/` per task — fresh copy each task so an in-task agent write can't corrupt the installer's persistent candidate state. Confirmed empirically that Hermes loads skills from `$HERMES_HOME/skills/`. ArtifactInstaller Protocol grows `verify_backup(backup_path)` so each installer validates its own backup format. Tool installer parses Python (as before); skill installer checks non-empty + UTF-8 decodable.

…t skill Five planted-bug tasks at evolution/validation/suites/systematic_debugging.jsonl covering bug shapes a debugging methodology should be able to handle: off-by-one in range, inverted condition, missing recursive base case, wrong arithmetic operator, mutable-default-argument aliasing. Each task ships: - solution.py with one planted bug and a docstring stating the spec - test_solution.py asserting the spec across the bug case plus enough edge cases that a too-narrow "fix" still fails - test_command="python test_solution.py" — the verdict from test_command-mode scoring Sanity tests in tests/validation/test_systematic_debugging_suite.py: - Suite shape: 5 tasks, every task has test_command, solution.py, and test_solution.py - Planted bugs are real: materializing the buggy fixture and running its test exits non-zero - Known fixes pass: applying the documented fix and re-running the test exits zero The last two are critical — they would catch a typo'd assertion or a pathologically-broken test that no candidate could satisfy. Fake target skill at tests/fixtures/skills/systematic_debugging/SKILL.md gives the closed-loop machinery a stable substrate for unit tests + manual smoke. Deliberately weak baseline ("try random changes") so GEPA has measurable headroom to evolve toward a real methodology.

Removes the placeholder UsageError on --closed-loop-during-evolution. The skill-evolution path now matches the tool-evolution path: closed-loop validator feedback flows into the reflection LM, and opt-in trainset mode lets behavioral verdicts break judge ties in GEPA's sum(scores) acceptance rule. Five coordinated changes: - CLI grows --closed-loop-saturation-threshold, --closed-loop-min-iters, --closed-loop-window-size, --closed-loop-mode, --closed-loop-in-valset, mirroring evolve_tool. No --closed-loop-hermes-repo: SkillSource discovery + the installer's writable workdir replace that. - New _maybe_build_closed_loop_cache_skill helper constructs the SkillFileInstaller, ClosedLoopValidator, and ClosedLoopFeedbackCache wired with write_text_artifact and .md suffix. - SkillModule.forward grows an optional closed_loop_task_id parameter. When present, skips the predictor LM call and returns a Prediction carrying _closed_loop_task_id + _candidate_text markers the metric routes on. - Trainset/valset get behavioral examples appended when mode is trainset or both. - Metric construction passes closed_loop_cache through. build_behavioral_examples gets a task_field kwarg (default "task" preserves tool-path behavior; skill-side passes "task_input" so the example shape matches SkillModule.forward). Default --closed-loop-mode is feedback for skills (not trainset). Skill bodies mutate heavily, so the "always" gate mode that trainset needs would fire the validator on every novel candidate — costlier than the tool path. Users opt into trainset explicitly.

…ness PLAN.md Phase 1 deviation #3 was the unfinished-skill-side-closed-loop note. systematic-debugging is now built (commits 1-4); update the deviation to reflect that, and explicitly call out which equivalents remain deferred (arxiv known-paper-recall, github-code-review planted-issue — same harness, different verdict mechanisms). docs/workflows.md grows a "Skill-path equivalent" subsection under Workflow 11 covering the two structural differences (test_command verdict vs tool-call membership; SkillFileInstaller vs in-place tool splice) and the rationale for the different default --closed-loop-mode (feedback for skills, since heavy mutation makes "always" gate mode 3-10x costlier than tool-path). tests/manual/skill_closed_loop_smoke.py drives the closed-loop layer end-to-end against the fake systematic-debugging skill + the planted-bug suite, with real hermes -z subprocesses. Pre-flight checks confirm hermes and python are on PATH. Asserts the cache fires, per-task verdicts come back, candidate-text caching dedups the second call, and no orphaned per-task tmpdirs leak. Not in CI — heavyweight (10 hermes -z calls, real LM spend).

jramos added 5 commits May 17, 2026 14:56

jramos merged commit a798701 into main May 17, 2026
4 checks passed

jramos deleted the feat/skill-closed-loop-systematic-debugging branch May 17, 2026 22:07

jramos mentioned this pull request May 17, 2026

feat: closed-loop validation follow-ups (agent model override + advanced bug suite) #63

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: skill-side closed-loop validation for systematic-debugging#62

feat: skill-side closed-loop validation for systematic-debugging#62
jramos merged 5 commits into
mainfrom
feat/skill-closed-loop-systematic-debugging

jramos commented May 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jramos commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Manual validation outcome

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jramos commented May 17, 2026 •

edited

Loading