Skip to content

feat: add test-both override for empirical deadlock resolution#1

Open
claytona500 wants to merge 1 commit intopeteromallet:mainfrom
claytona500:feat/test-both-override
Open

feat: add test-both override for empirical deadlock resolution#1
claytona500 wants to merge 1 commit intopeteromallet:mainfrom
claytona500:feat/test-both-override

Conversation

@claytona500
Copy link

Summary

When the critique loop hits ESCALATE, the current options are add-note, force-proceed, or abort — all of which punt the decision to the human without evidence. This PR adds a test-both override that breaks deadlocks empirically:

  • Invokes a judge agent to evaluate the current plan (approach A) against an alternative approach (approach B) that addresses the unresolved flags
  • The judge renders a structured verdict: approach_a, approach_b, or synthesis
  • The verdict determines the next step: approach_a wins → gate, approach_b/synthesis → integrate with the judge's recommendations

Motivation

Inspired by adversarial convergence patterns where competing approaches are tested empirically rather than debated endlessly. When the same critique concerns recur across iterations and neither force-proceed nor add-note resolves the impasse, test-both gives the orchestrator an evidence-based path forward.

Changes

  • schemas.py — New test-both.json schema with structured approach comparison and verdict enum
  • prompts.py — Judge prompt that evaluates both approaches against unresolved flags
  • workers.py — Mock worker, schema filename mapping, session key for test-both step
  • _core.py — Default agent routing (claude) for test-both
  • cli.py_override_test_both handler with full state machine integration; updated infer_next_steps and argparse choices
  • instructions.md — Documentation for the new override option
  • tests/test_test_both.py — 15 new tests covering all verdict paths, state transitions, schema, and mock
  • tests/test_megaplan.py, tests/test_schemas.py — Updated existing parametrized tests

Usage

megaplan override test-both --plan <name> --reason "critique loop stagnated"

Test plan

  • All 15 new tests pass
  • All 289 existing tests pass (274 original + 15 new)
  • test-both only available from EVALUATED state with ESCALATE/ABORT recommendation
  • All three verdict paths (approach_a, approach_b, synthesis) produce correct state transitions
  • History entry, override metadata, and artifacts written correctly
  • Manual test with real agents on a stagnated plan

🤖 Generated with Claude Code

When the critique loop stagnates (ESCALATE), the only options today are
add-note, force-proceed, or abort — all of which punt the decision to the
human without evidence. This adds a test-both override that invokes a judge
agent to evaluate the current plan against an alternative approach, then
renders a verdict (approach_a, approach_b, or synthesis) based on empirical
assessment.

Changes:
- New test-both.json schema for structured judge output
- Judge prompt in prompts.py that evaluates both approaches against
  unresolved flags
- _override_test_both handler in cli.py with full state machine integration
- Mock worker for test-both in workers.py
- Default agent routing (claude) in _core.py
- Updated infer_next_steps to surface test-both for ESCALATE/ABORT
- Documentation in instructions.md
- 15 new tests covering all verdict paths, state transitions, and schema

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant