Skip to content

feat: closed-loop validation follow-ups (agent model override + advanced bug suite)#63

Merged
jramos merged 3 commits into
mainfrom
feat/closed-loop-followups
May 18, 2026
Merged

feat: closed-loop validation follow-ups (agent model override + advanced bug suite)#63
jramos merged 3 commits into
mainfrom
feat/closed-loop-followups

Conversation

@jramos
Copy link
Copy Markdown
Owner

@jramos jramos commented May 17, 2026

Summary

Direct follow-ups to the manual validation finding from #62: a capable reasoning model solved every textbook planted bug regardless of skill text, leaving the closed-loop verdict at 5/5 across the board and hiding the behavioral signal. Three coordinated changes to recover discrimination, plus an honest accounting of what they did and didn't achieve.

Commits:

  1. --closed-loop-agent-model override on HermesAgentRunner — when set, the validator's subprocess invocation becomes hermes -m MODEL -z ..., running the agent against a deliberately weaker model than the user's daily-driver Hermes default without mutating ~/.hermes/config.yaml. Plumbed through both evolve_skill and evolve_tool cache helpers + CLI flags. Default unset → existing behavior bit-for-bit preserved.

  2. systematic_debugging_advanced.jsonl — 5 harder planted bugs designed to discriminate skill-text variants on capable agents. Each exercises a different cognitive failure mode (iteration, state, precision, algorithm, OOP semantics) and breaks edge cases the obvious patch misses. Sanity tests verify every planted bug is real and every documented fix passes, plus a regression guard against task_id collisions with the basic suite.

  3. --closed-loop-task-timeout-seconds knob — bump the per-task wall-clock budget for slow reasoning models (o1/o3-family typically need 200-300s/task). Surfaced by validation: even when the override flag works mechanically, the default 120s budget abstains most tasks before they finish.

Plus: the manual smoke harness at tests/manual/skill_closed_loop_smoke.py grows --suite {basic|advanced}, --agent-model MODEL, and --task-timeout-seconds N flags. docs/workflows.md documents all three knobs along with the empirical caveat below.

Test coverage: 1029 passing (was 1010; +19 new — 8 advanced-suite sanity + 11 plumbing/CLI tests across both paths). All mock-only in CI.

Manual validation outcome

Honest accounting of what was tested with real hermes -z subprocesses against the user's actual setup (gpt-5.4-mini default):

Run Result What it proves
Smoke, basic suite, default model 5/5 baseline → 5/5 evolved (saturation) wiring works on real hermes calls
Smoke, advanced suite, default model 5/5 baseline → 5/5 evolved (saturation) harder bugs don't help against capable reasoning models on this domain
Smoke, advanced suite, --agent-model o3-mini 9 of 10 task runs abstained (timed out at default 120s) flag flows end-to-end; default timeout too short for o3-mini

What this PR ships ≠ what it validates. The --closed-loop-agent-model and --closed-loop-task-timeout-seconds flags are plumbed correctly and the advanced suite is a genuinely harder regression check. But on this user's setup (capable reasoning model default + only o3-mini as a compatible weaker model + 120s default timeout), the closed-loop signal saturates either at 5/5 (capable model) or at abstention (slow model). No "improvement caught" or "regression caught" outcome was produced; only saturation.

The honest interpretation. For this user, Python-debugging-of-textbook-bugs may not be a discriminating evaluation surface for the closed-loop signal regardless of which knob is turned. The flags are useful for future setups (a user with a faster weaker model would get real signal, and the timeout knob lets o3-mini-class models actually finish). The framework's central hypothesis (behavioral wins break judge ties on saturated baselines) remains untested end-to-end and likely needs a different evaluation domain to demonstrate — multi-file refactoring, ambiguous specs with edge cases, iterative hypothesis-testing tasks where methodology matters more than recognition.

Test plan

  • uv run pytest tests/ -q --ignore=tests/manual — 1029 passed
  • CI green on Python 3.10–3.13 (pre-timeout-commit; re-checking with the new commit)
  • Manual smoke, basic suite, default model — 5/5 saturation, wiring confirmed
  • Manual smoke, advanced suite, default model — 5/5 saturation, harder bugs don't discriminate
  • Manual smoke, advanced suite, --agent-model o3-mini — flag works; timeout exposes need for --closed-loop-task-timeout-seconds
  • Future: heavy-budget evolve_skill against a setup that produces non-saturated baseline — requires a different evaluation surface, out of scope for this PR

jramos added 3 commits May 17, 2026 16:13
…unner

Adds an optional ``model`` kwarg to HermesAgentRunner. When set, the
subprocess invocation becomes ``hermes -m MODEL -z ...`` so the agent
runs against a deliberately weaker model than the user's daily-driver
Hermes default, without mutating ~/.hermes/config.yaml.

Surfaced by the manual validation run on the systematic-debugging
suite: a capable reasoning model solved all 5 textbook planted bugs
on the deliberately-weak baseline skill, leaving the closed-loop
verdict at 5/5 across the board and hiding the behavioral signal
the validator is supposed to surface. Swapping the model via
``hermes config set`` collapsed the structured ``model:`` dict
(provider/base_url/api_key) into a bare string and broke the user's
custom-endpoint config, so per-invocation override via the CLI is
the cleaner seam.

Plumbed through the existing cache helpers on both paths:
  evolve_skill: --closed-loop-agent-model → _maybe_build_closed_loop_cache_skill
  evolve_tool:  --closed-loop-agent-model → _maybe_build_closed_loop_cache

The flag is unset by default — existing behavior is bit-for-bit
preserved when callers don't opt in. ``-m MODEL`` is inserted before
``-z`` so hermes parses it as a global flag, not as part of the prompt.
The textbook 5-bug suite at systematic_debugging.jsonl saturates at 5/5
against capable reasoning models — every bug is the kind of thing the
agent solves on first try regardless of what the skill text says. That
zeroes out the behavioral signal closed-loop validation is supposed to
surface.

New suite at systematic_debugging_advanced.jsonl ships 5 bugs that
require structured debugging because the obvious patch only fixes the
failing test, leaving the spec's edge cases broken. Each exercises a
different cognitive failure mode:

  - debug_generator_exhaustion: function iterates input twice; works
    on lists, breaks on single-pass generators
  - debug_shared_mutable_return: cached default returns a live ref;
    caller mutation leaks into the cache
  - debug_float_precision_equality: == on a computed sum; fix requires
    math.isclose with documented tolerance
  - debug_binary_search_boundary: bisect_right code masquerading as
    bisect_left; middle insertion works, leftmost-of-equals fails
  - debug_class_vs_instance_attribute: class-level mutable default;
    instances share state instead of being independent

Sanity tests prove every planted bug is real (baseline test fails) and
every documented fix passes, plus a regression guard that task_ids
don't collide with the basic suite.

Smoke harness grows --suite {basic|advanced} and --agent-model flags so
both wiring sanity and headroom validation are driven by the same
script. Docs updated with the two-knob guidance (suite + agent model)
for recovering discrimination when the default Hermes model saturates.
…ion finding

Quality-of-life knob plumbed through both evolve_skill and evolve_tool:
override HermesAgentRunner's per-task wall-clock budget when the chosen
agent model is slow. Reasoning models other than the smallest typically
take 200-300s per debugging task; the default 120s causes them to
abstain (timeout) silently rather than producing a verdict.

Manual validation finding: both planted-bug suites (basic + advanced)
saturate at 5/5 against capable reasoning models (gpt-5.4-mini on this
setup), and the only confirmed-compatible weaker reasoning model
(o3-mini) is slow enough to abstain most tasks at the default timeout.
The framework's wiring is correct end-to-end, but Python-debugging-of-
textbook-bugs may not discriminate skill-text variants on capable
models regardless of the suite or model swap. Documented in
docs/workflows.md alongside the suggestion to evaluate on surfaces
where methodology matters more than recognition (multi-file refactoring,
ambiguous-spec tasks, iterative hypothesis-testing scenarios).

Smoke harness grows --task-timeout-seconds to match.
@jramos jramos merged commit 8a8893e into main May 18, 2026
4 checks passed
@jramos jramos deleted the feat/closed-loop-followups branch May 18, 2026 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant