Skip to content

feat: skill-side closed-loop validation for systematic-debugging#62

Merged
jramos merged 5 commits into
mainfrom
feat/skill-closed-loop-systematic-debugging
May 17, 2026
Merged

feat: skill-side closed-loop validation for systematic-debugging#62
jramos merged 5 commits into
mainfrom
feat/skill-closed-loop-systematic-debugging

Conversation

@jramos
Copy link
Copy Markdown
Owner

@jramos jramos commented May 17, 2026

Summary

Brings closed-loop validation to the skill-evolution path. Five commits, all on top of the closed-loop infrastructure shipped on the tool path. Closes PLAN.md Phase 1 deviation #3 for systematic-debugging; arxiv known-paper-recall and github-code-review planted-issue remain deferred (different verdict mechanisms).

The framework can now evaluate evolved skills not just by LM-judge composite score but by whether following the skill actually causes a real Hermes agent to debug + fix a planted bug correctly.

Commits:

  1. test_command verdict mode for skill-shaped tasks — extends Task with optional test_command; score_task branches on it (run the command in fixture_dir, pass iff exit code zero). Tool-side scoring unchanged.
  2. SkillFileInstaller + cache artifact_writer injection + runner skills-source plumbing — installer copies baseline skill into a writable workdir, runner stages it into each per-task sandbox via TaskRunContext.skills_src. ClosedLoopFeedbackCache parameterized to accept any artifact writer (default = current MCP manifest, preserving tool path).
  3. systematic-debugging planted-bug suite (5 tasks: off-by-one, inverted condition, missing base case, wrong operator, mutable default arg) + fake target skill fixture (deliberately weak baseline so GEPA has headroom). Sanity tests verify every planted bug is real and every known fix passes.
  4. Wire closed-loop end-to-end through evolve_skill — five --closed-loop-* CLI flags mirroring evolve_tool, _maybe_build_closed_loop_cache_skill helper, SkillModule.forward behavioral branch, behavioral-example injection into trainset/valset, metric construction with the cache.
  5. Docs + manual smoke harness + PLAN deviation closure.

Test coverage: 1004 passing (was 950; +54 new). All mock-only in CI. Manual smoke at tests/manual/skill_closed_loop_smoke.py drives the closed-loop layer end-to-end against real hermes -z subprocesses (10 calls per run, real LM spend; not in CI).

Default --closed-loop-mode is feedback, not trainset. Skill bodies mutate heavily, so trainset mode's gate_mode=always (validator fires on every novel candidate) is 3-10x more expensive than on the tool path. Users opt into trainset explicitly when the cost is acceptable.

Manual validation outcome

Stage 1 smoke confirmed the closed-loop wiring works end-to-end against real hermes -z subprocesses:

Decision: pass
Baseline:  5/5 (100%)
Evolved:   5/5 (100%)
Per-task:  0 wins, 0 losses, 5 ties

All 10 hermes calls completed cleanly in ~4 min, candidate-text cache dedup'd on the second invocation, no orphaned per-task tmpdirs.

The 5/5 baseline pass rate is itself the headline finding: a reasoning-class model (gpt-5.4-mini in this case) solves the textbook planted bugs even with the deliberately weak baseline skill text. This is the saturation outcome (#3 from the design plan's three-outcome triage) — proves the framework is wired correctly but tells us the bug suite is too easy to discriminate skill text variants on capable models.

Stage 2 (heavy-budget GEPA run with --closed-loop-mode trainset) was skipped because the saturated baseline guarantees zero behavioral signal for GEPA to act on. Two follow-ups are scoped to fix this in a future PR: (a) add an optional --model kwarg to HermesAgentRunner so closed-loop validation can run against a deliberately weaker agent model independently of the user's Hermes default; (b) expand the suite with harder bugs (subtle binary-search off-by-one, generator exhaustion, mutable cached return, float-precision equality, async/state ordering) that survive a textbook debugging methodology.

Test plan

  • uv run pytest tests/ -q --ignore=tests/manual — 1004 passed
  • CI green on Python 3.10–3.13 (4 versions, mock-only tests)
  • Tool-path regression check: test_closed_loop_feedback.py, test_validator.py, test_safety.py, test_evolve_tool_closed_loop.py all green after cache kwarg rename + verify_backup Protocol extension
  • Manual smoke (tests/manual/skill_closed_loop_smoke.py) — passed end-to-end, ~4 min, all assertions green; wiring confirmed via real hermes -z × 10
  • Heavy-budget GEPA classification — deferred; baseline saturates at 5/5 against the configured reasoning model, leaving GEPA no behavioral signal to act on. Follow-up PR adds model override + harder bugs to make the classification meaningful.

jramos added 5 commits May 17, 2026 14:56
Tool-side closed-loop scoring is a tool-call membership test (was an
expected tool invoked, was a forbidden tool avoided). Skill-side suites
need a different verdict: did the agent's edits make a planted test
pass.

Extends Task with an optional test_command field. When set, score_task
runs the command in fixture_dir after the agent finishes and passes iff
exit code is zero — tool-call membership is bypassed. Command failure
modes (nonzero exit, timeout, FileNotFoundError) all map to "not
passed" since the verdict is exit-code-driven. Runner errors still
abstain.

Uses shlex.split (no shell) so suite-controlled commands don't pick up
shell metacharacters. Default timeout 60s.

Tool-side suites (patch, write_file, search_files) are unchanged — they
never set test_command and follow the existing tool-call rule.
Adds the plumbing needed to run closed-loop validation on a SKILL.md,
not just a tool description. Three coordinated pieces:

SkillFileInstaller copies the baseline skill directory (SKILL.md + any
siblings) into a caller-owned writable workdir at construction. install()
then atomically overwrites the target with candidate text. The original
on-disk skill — possibly in a read-only plugin cache — is never touched.
The installer exposes `skills_src`, the directory the runner needs to
stage into its per-task sandbox so `hermes -z` discovers the candidate.

ClosedLoopFeedbackCache grows an `artifact_writer` callable kwarg
(default = the existing single-tool MCP manifest JSON, preserving the
tool path bit-for-bit) and an `artifact_suffix` for the temp filenames.
Skill-side callers inject `write_text_artifact` and `.md`. Constructor
kwargs renamed `tool_name`→`artifact_name` and
`baseline_description`→`baseline_artifact_text` for semantic neutrality;
tool-side call site updated.

HermesAgentRunner._prime_sandbox reads `TaskRunContext.skills_src`
(new optional field) and copies its contents into `<sandbox>/skills/`
per task — fresh copy each task so an in-task agent write can't corrupt
the installer's persistent candidate state. Confirmed empirically that
Hermes loads skills from `$HERMES_HOME/skills/`.

ArtifactInstaller Protocol grows `verify_backup(backup_path)` so each
installer validates its own backup format. Tool installer parses Python
(as before); skill installer checks non-empty + UTF-8 decodable.
…t skill

Five planted-bug tasks at evolution/validation/suites/systematic_debugging.jsonl
covering bug shapes a debugging methodology should be able to handle:
off-by-one in range, inverted condition, missing recursive base case,
wrong arithmetic operator, mutable-default-argument aliasing.

Each task ships:
  - solution.py with one planted bug and a docstring stating the spec
  - test_solution.py asserting the spec across the bug case plus
    enough edge cases that a too-narrow "fix" still fails
  - test_command="python test_solution.py" — the verdict from
    test_command-mode scoring

Sanity tests in tests/validation/test_systematic_debugging_suite.py:
  - Suite shape: 5 tasks, every task has test_command, solution.py,
    and test_solution.py
  - Planted bugs are real: materializing the buggy fixture and
    running its test exits non-zero
  - Known fixes pass: applying the documented fix and re-running
    the test exits zero
The last two are critical — they would catch a typo'd assertion or
a pathologically-broken test that no candidate could satisfy.

Fake target skill at tests/fixtures/skills/systematic_debugging/SKILL.md
gives the closed-loop machinery a stable substrate for unit tests +
manual smoke. Deliberately weak baseline ("try random changes") so
GEPA has measurable headroom to evolve toward a real methodology.
Removes the placeholder UsageError on --closed-loop-during-evolution.
The skill-evolution path now matches the tool-evolution path:
closed-loop validator feedback flows into the reflection LM, and
opt-in trainset mode lets behavioral verdicts break judge ties in
GEPA's sum(scores) acceptance rule.

Five coordinated changes:

- CLI grows --closed-loop-saturation-threshold, --closed-loop-min-iters,
  --closed-loop-window-size, --closed-loop-mode, --closed-loop-in-valset,
  mirroring evolve_tool. No --closed-loop-hermes-repo: SkillSource
  discovery + the installer's writable workdir replace that.
- New _maybe_build_closed_loop_cache_skill helper constructs the
  SkillFileInstaller, ClosedLoopValidator, and ClosedLoopFeedbackCache
  wired with write_text_artifact and .md suffix.
- SkillModule.forward grows an optional closed_loop_task_id parameter.
  When present, skips the predictor LM call and returns a Prediction
  carrying _closed_loop_task_id + _candidate_text markers the metric
  routes on.
- Trainset/valset get behavioral examples appended when mode is
  trainset or both.
- Metric construction passes closed_loop_cache through.

build_behavioral_examples gets a task_field kwarg (default "task"
preserves tool-path behavior; skill-side passes "task_input" so the
example shape matches SkillModule.forward).

Default --closed-loop-mode is feedback for skills (not trainset).
Skill bodies mutate heavily, so the "always" gate mode that trainset
needs would fire the validator on every novel candidate — costlier
than the tool path. Users opt into trainset explicitly.
…ness

PLAN.md Phase 1 deviation #3 was the unfinished-skill-side-closed-loop
note. systematic-debugging is now built (commits 1-4); update the
deviation to reflect that, and explicitly call out which equivalents
remain deferred (arxiv known-paper-recall, github-code-review
planted-issue — same harness, different verdict mechanisms).

docs/workflows.md grows a "Skill-path equivalent" subsection under
Workflow 11 covering the two structural differences (test_command
verdict vs tool-call membership; SkillFileInstaller vs in-place tool
splice) and the rationale for the different default --closed-loop-mode
(feedback for skills, since heavy mutation makes "always" gate mode
3-10x costlier than tool-path).

tests/manual/skill_closed_loop_smoke.py drives the closed-loop layer
end-to-end against the fake systematic-debugging skill + the planted-bug
suite, with real hermes -z subprocesses. Pre-flight checks confirm
hermes and python are on PATH. Asserts the cache fires, per-task
verdicts come back, candidate-text caching dedups the second call,
and no orphaned per-task tmpdirs leak. Not in CI — heavyweight (10
hermes -z calls, real LM spend).
@jramos jramos merged commit a798701 into main May 17, 2026
4 checks passed
@jramos jramos deleted the feat/skill-closed-loop-systematic-debugging branch May 17, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant