feat: skill-side closed-loop validation for systematic-debugging#62
Merged
Conversation
Tool-side closed-loop scoring is a tool-call membership test (was an expected tool invoked, was a forbidden tool avoided). Skill-side suites need a different verdict: did the agent's edits make a planted test pass. Extends Task with an optional test_command field. When set, score_task runs the command in fixture_dir after the agent finishes and passes iff exit code is zero — tool-call membership is bypassed. Command failure modes (nonzero exit, timeout, FileNotFoundError) all map to "not passed" since the verdict is exit-code-driven. Runner errors still abstain. Uses shlex.split (no shell) so suite-controlled commands don't pick up shell metacharacters. Default timeout 60s. Tool-side suites (patch, write_file, search_files) are unchanged — they never set test_command and follow the existing tool-call rule.
Adds the plumbing needed to run closed-loop validation on a SKILL.md, not just a tool description. Three coordinated pieces: SkillFileInstaller copies the baseline skill directory (SKILL.md + any siblings) into a caller-owned writable workdir at construction. install() then atomically overwrites the target with candidate text. The original on-disk skill — possibly in a read-only plugin cache — is never touched. The installer exposes `skills_src`, the directory the runner needs to stage into its per-task sandbox so `hermes -z` discovers the candidate. ClosedLoopFeedbackCache grows an `artifact_writer` callable kwarg (default = the existing single-tool MCP manifest JSON, preserving the tool path bit-for-bit) and an `artifact_suffix` for the temp filenames. Skill-side callers inject `write_text_artifact` and `.md`. Constructor kwargs renamed `tool_name`→`artifact_name` and `baseline_description`→`baseline_artifact_text` for semantic neutrality; tool-side call site updated. HermesAgentRunner._prime_sandbox reads `TaskRunContext.skills_src` (new optional field) and copies its contents into `<sandbox>/skills/` per task — fresh copy each task so an in-task agent write can't corrupt the installer's persistent candidate state. Confirmed empirically that Hermes loads skills from `$HERMES_HOME/skills/`. ArtifactInstaller Protocol grows `verify_backup(backup_path)` so each installer validates its own backup format. Tool installer parses Python (as before); skill installer checks non-empty + UTF-8 decodable.
…t skill
Five planted-bug tasks at evolution/validation/suites/systematic_debugging.jsonl
covering bug shapes a debugging methodology should be able to handle:
off-by-one in range, inverted condition, missing recursive base case,
wrong arithmetic operator, mutable-default-argument aliasing.
Each task ships:
- solution.py with one planted bug and a docstring stating the spec
- test_solution.py asserting the spec across the bug case plus
enough edge cases that a too-narrow "fix" still fails
- test_command="python test_solution.py" — the verdict from
test_command-mode scoring
Sanity tests in tests/validation/test_systematic_debugging_suite.py:
- Suite shape: 5 tasks, every task has test_command, solution.py,
and test_solution.py
- Planted bugs are real: materializing the buggy fixture and
running its test exits non-zero
- Known fixes pass: applying the documented fix and re-running
the test exits zero
The last two are critical — they would catch a typo'd assertion or
a pathologically-broken test that no candidate could satisfy.
Fake target skill at tests/fixtures/skills/systematic_debugging/SKILL.md
gives the closed-loop machinery a stable substrate for unit tests +
manual smoke. Deliberately weak baseline ("try random changes") so
GEPA has measurable headroom to evolve toward a real methodology.
Removes the placeholder UsageError on --closed-loop-during-evolution. The skill-evolution path now matches the tool-evolution path: closed-loop validator feedback flows into the reflection LM, and opt-in trainset mode lets behavioral verdicts break judge ties in GEPA's sum(scores) acceptance rule. Five coordinated changes: - CLI grows --closed-loop-saturation-threshold, --closed-loop-min-iters, --closed-loop-window-size, --closed-loop-mode, --closed-loop-in-valset, mirroring evolve_tool. No --closed-loop-hermes-repo: SkillSource discovery + the installer's writable workdir replace that. - New _maybe_build_closed_loop_cache_skill helper constructs the SkillFileInstaller, ClosedLoopValidator, and ClosedLoopFeedbackCache wired with write_text_artifact and .md suffix. - SkillModule.forward grows an optional closed_loop_task_id parameter. When present, skips the predictor LM call and returns a Prediction carrying _closed_loop_task_id + _candidate_text markers the metric routes on. - Trainset/valset get behavioral examples appended when mode is trainset or both. - Metric construction passes closed_loop_cache through. build_behavioral_examples gets a task_field kwarg (default "task" preserves tool-path behavior; skill-side passes "task_input" so the example shape matches SkillModule.forward). Default --closed-loop-mode is feedback for skills (not trainset). Skill bodies mutate heavily, so the "always" gate mode that trainset needs would fire the validator on every novel candidate — costlier than the tool path. Users opt into trainset explicitly.
…ness PLAN.md Phase 1 deviation #3 was the unfinished-skill-side-closed-loop note. systematic-debugging is now built (commits 1-4); update the deviation to reflect that, and explicitly call out which equivalents remain deferred (arxiv known-paper-recall, github-code-review planted-issue — same harness, different verdict mechanisms). docs/workflows.md grows a "Skill-path equivalent" subsection under Workflow 11 covering the two structural differences (test_command verdict vs tool-call membership; SkillFileInstaller vs in-place tool splice) and the rationale for the different default --closed-loop-mode (feedback for skills, since heavy mutation makes "always" gate mode 3-10x costlier than tool-path). tests/manual/skill_closed_loop_smoke.py drives the closed-loop layer end-to-end against the fake systematic-debugging skill + the planted-bug suite, with real hermes -z subprocesses. Pre-flight checks confirm hermes and python are on PATH. Asserts the cache fires, per-task verdicts come back, candidate-text caching dedups the second call, and no orphaned per-task tmpdirs leak. Not in CI — heavyweight (10 hermes -z calls, real LM spend).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings closed-loop validation to the skill-evolution path. Five commits, all on top of the closed-loop infrastructure shipped on the tool path. Closes PLAN.md Phase 1 deviation #3 for
systematic-debugging; arxiv known-paper-recall and github-code-review planted-issue remain deferred (different verdict mechanisms).The framework can now evaluate evolved skills not just by LM-judge composite score but by whether following the skill actually causes a real Hermes agent to debug + fix a planted bug correctly.
Commits:
test_commandverdict mode for skill-shaped tasks — extendsTaskwith optionaltest_command;score_taskbranches on it (run the command infixture_dir, pass iff exit code zero). Tool-side scoring unchanged.SkillFileInstaller+ cacheartifact_writerinjection + runner skills-source plumbing — installer copies baseline skill into a writable workdir, runner stages it into each per-task sandbox viaTaskRunContext.skills_src.ClosedLoopFeedbackCacheparameterized to accept any artifact writer (default = current MCP manifest, preserving tool path).systematic-debuggingplanted-bug suite (5 tasks: off-by-one, inverted condition, missing base case, wrong operator, mutable default arg) + fake target skill fixture (deliberately weak baseline so GEPA has headroom). Sanity tests verify every planted bug is real and every known fix passes.evolve_skill— five--closed-loop-*CLI flags mirroringevolve_tool,_maybe_build_closed_loop_cache_skillhelper,SkillModule.forwardbehavioral branch, behavioral-example injection into trainset/valset, metric construction with the cache.Test coverage: 1004 passing (was 950; +54 new). All mock-only in CI. Manual smoke at
tests/manual/skill_closed_loop_smoke.pydrives the closed-loop layer end-to-end against realhermes -zsubprocesses (10 calls per run, real LM spend; not in CI).Default
--closed-loop-modeisfeedback, nottrainset. Skill bodies mutate heavily, so trainset mode'sgate_mode=always(validator fires on every novel candidate) is 3-10x more expensive than on the tool path. Users opt into trainset explicitly when the cost is acceptable.Manual validation outcome
Stage 1 smoke confirmed the closed-loop wiring works end-to-end against real
hermes -zsubprocesses:All 10 hermes calls completed cleanly in ~4 min, candidate-text cache dedup'd on the second invocation, no orphaned per-task tmpdirs.
The 5/5 baseline pass rate is itself the headline finding: a reasoning-class model (gpt-5.4-mini in this case) solves the textbook planted bugs even with the deliberately weak baseline skill text. This is the saturation outcome (#3 from the design plan's three-outcome triage) — proves the framework is wired correctly but tells us the bug suite is too easy to discriminate skill text variants on capable models.
Stage 2 (heavy-budget GEPA run with
--closed-loop-mode trainset) was skipped because the saturated baseline guarantees zero behavioral signal for GEPA to act on. Two follow-ups are scoped to fix this in a future PR: (a) add an optional--modelkwarg toHermesAgentRunnerso closed-loop validation can run against a deliberately weaker agent model independently of the user's Hermes default; (b) expand the suite with harder bugs (subtle binary-search off-by-one, generator exhaustion, mutable cached return, float-precision equality, async/state ordering) that survive a textbook debugging methodology.Test plan
uv run pytest tests/ -q --ignore=tests/manual— 1004 passedtest_closed_loop_feedback.py,test_validator.py,test_safety.py,test_evolve_tool_closed_loop.pyall green after cache kwarg rename +verify_backupProtocol extensiontests/manual/skill_closed_loop_smoke.py) — passed end-to-end, ~4 min, all assertions green; wiring confirmed via realhermes -z× 10