feat: closed-loop validation follow-ups (agent model override + advanced bug suite) by jramos · Pull Request #63 · jramos/agent-self-evolution

jramos · 2026-05-17T22:22:24Z

Summary

Direct follow-ups to the manual validation finding from #62: a capable reasoning model solved every textbook planted bug regardless of skill text, leaving the closed-loop verdict at 5/5 across the board and hiding the behavioral signal. Three coordinated changes to recover discrimination, plus an honest accounting of what they did and didn't achieve.

Commits:

--closed-loop-agent-model override on HermesAgentRunner — when set, the validator's subprocess invocation becomes hermes -m MODEL -z ..., running the agent against a deliberately weaker model than the user's daily-driver Hermes default without mutating ~/.hermes/config.yaml. Plumbed through both evolve_skill and evolve_tool cache helpers + CLI flags. Default unset → existing behavior bit-for-bit preserved.
systematic_debugging_advanced.jsonl — 5 harder planted bugs designed to discriminate skill-text variants on capable agents. Each exercises a different cognitive failure mode (iteration, state, precision, algorithm, OOP semantics) and breaks edge cases the obvious patch misses. Sanity tests verify every planted bug is real and every documented fix passes, plus a regression guard against task_id collisions with the basic suite.
--closed-loop-task-timeout-seconds knob — bump the per-task wall-clock budget for slow reasoning models (o1/o3-family typically need 200-300s/task). Surfaced by validation: even when the override flag works mechanically, the default 120s budget abstains most tasks before they finish.

Plus: the manual smoke harness at tests/manual/skill_closed_loop_smoke.py grows --suite {basic|advanced}, --agent-model MODEL, and --task-timeout-seconds N flags. docs/workflows.md documents all three knobs along with the empirical caveat below.

Test coverage: 1029 passing (was 1010; +19 new — 8 advanced-suite sanity + 11 plumbing/CLI tests across both paths). All mock-only in CI.

Manual validation outcome

Honest accounting of what was tested with real hermes -z subprocesses against the user's actual setup (gpt-5.4-mini default):

Run	Result	What it proves
Smoke, basic suite, default model	5/5 baseline → 5/5 evolved (saturation)	wiring works on real hermes calls
Smoke, advanced suite, default model	5/5 baseline → 5/5 evolved (saturation)	harder bugs don't help against capable reasoning models on this domain
Smoke, advanced suite, `--agent-model o3-mini`	9 of 10 task runs abstained (timed out at default 120s)	flag flows end-to-end; default timeout too short for o3-mini

What this PR ships ≠ what it validates. The --closed-loop-agent-model and --closed-loop-task-timeout-seconds flags are plumbed correctly and the advanced suite is a genuinely harder regression check. But on this user's setup (capable reasoning model default + only o3-mini as a compatible weaker model + 120s default timeout), the closed-loop signal saturates either at 5/5 (capable model) or at abstention (slow model). No "improvement caught" or "regression caught" outcome was produced; only saturation.

The honest interpretation. For this user, Python-debugging-of-textbook-bugs may not be a discriminating evaluation surface for the closed-loop signal regardless of which knob is turned. The flags are useful for future setups (a user with a faster weaker model would get real signal, and the timeout knob lets o3-mini-class models actually finish). The framework's central hypothesis (behavioral wins break judge ties on saturated baselines) remains untested end-to-end and likely needs a different evaluation domain to demonstrate — multi-file refactoring, ambiguous specs with edge cases, iterative hypothesis-testing tasks where methodology matters more than recognition.

Test plan

uv run pytest tests/ -q --ignore=tests/manual — 1029 passed
CI green on Python 3.10–3.13 (pre-timeout-commit; re-checking with the new commit)
Manual smoke, basic suite, default model — 5/5 saturation, wiring confirmed
Manual smoke, advanced suite, default model — 5/5 saturation, harder bugs don't discriminate
Manual smoke, advanced suite, --agent-model o3-mini — flag works; timeout exposes need for --closed-loop-task-timeout-seconds
Future: heavy-budget evolve_skill against a setup that produces non-saturated baseline — requires a different evaluation surface, out of scope for this PR

…unner Adds an optional ``model`` kwarg to HermesAgentRunner. When set, the subprocess invocation becomes ``hermes -m MODEL -z ...`` so the agent runs against a deliberately weaker model than the user's daily-driver Hermes default, without mutating ~/.hermes/config.yaml. Surfaced by the manual validation run on the systematic-debugging suite: a capable reasoning model solved all 5 textbook planted bugs on the deliberately-weak baseline skill, leaving the closed-loop verdict at 5/5 across the board and hiding the behavioral signal the validator is supposed to surface. Swapping the model via ``hermes config set`` collapsed the structured ``model:`` dict (provider/base_url/api_key) into a bare string and broke the user's custom-endpoint config, so per-invocation override via the CLI is the cleaner seam. Plumbed through the existing cache helpers on both paths: evolve_skill: --closed-loop-agent-model → _maybe_build_closed_loop_cache_skill evolve_tool: --closed-loop-agent-model → _maybe_build_closed_loop_cache The flag is unset by default — existing behavior is bit-for-bit preserved when callers don't opt in. ``-m MODEL`` is inserted before ``-z`` so hermes parses it as a global flag, not as part of the prompt.

The textbook 5-bug suite at systematic_debugging.jsonl saturates at 5/5 against capable reasoning models — every bug is the kind of thing the agent solves on first try regardless of what the skill text says. That zeroes out the behavioral signal closed-loop validation is supposed to surface. New suite at systematic_debugging_advanced.jsonl ships 5 bugs that require structured debugging because the obvious patch only fixes the failing test, leaving the spec's edge cases broken. Each exercises a different cognitive failure mode: - debug_generator_exhaustion: function iterates input twice; works on lists, breaks on single-pass generators - debug_shared_mutable_return: cached default returns a live ref; caller mutation leaks into the cache - debug_float_precision_equality: == on a computed sum; fix requires math.isclose with documented tolerance - debug_binary_search_boundary: bisect_right code masquerading as bisect_left; middle insertion works, leftmost-of-equals fails - debug_class_vs_instance_attribute: class-level mutable default; instances share state instead of being independent Sanity tests prove every planted bug is real (baseline test fails) and every documented fix passes, plus a regression guard that task_ids don't collide with the basic suite. Smoke harness grows --suite {basic|advanced} and --agent-model flags so both wiring sanity and headroom validation are driven by the same script. Docs updated with the two-knob guidance (suite + agent model) for recovering discrimination when the default Hermes model saturates.

…ion finding Quality-of-life knob plumbed through both evolve_skill and evolve_tool: override HermesAgentRunner's per-task wall-clock budget when the chosen agent model is slow. Reasoning models other than the smallest typically take 200-300s per debugging task; the default 120s causes them to abstain (timeout) silently rather than producing a verdict. Manual validation finding: both planted-bug suites (basic + advanced) saturate at 5/5 against capable reasoning models (gpt-5.4-mini on this setup), and the only confirmed-compatible weaker reasoning model (o3-mini) is slow enough to abstain most tasks at the default timeout. The framework's wiring is correct end-to-end, but Python-debugging-of- textbook-bugs may not discriminate skill-text variants on capable models regardless of the suite or model swap. Documented in docs/workflows.md alongside the suggestion to evaluate on surfaces where methodology matters more than recognition (multi-file refactoring, ambiguous-spec tasks, iterative hypothesis-testing scenarios). Smoke harness grows --task-timeout-seconds to match.

jramos added 3 commits May 17, 2026 16:13

jramos merged commit 8a8893e into main May 18, 2026
4 checks passed

jramos deleted the feat/closed-loop-followups branch May 18, 2026 01:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: closed-loop validation follow-ups (agent model override + advanced bug suite)#63

feat: closed-loop validation follow-ups (agent model override + advanced bug suite)#63
jramos merged 3 commits into
mainfrom
feat/closed-loop-followups

jramos commented May 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jramos commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Manual validation outcome

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jramos commented May 17, 2026 •

edited

Loading