Skip to content

factory:go: stop marking issues 'done' when the agent wrote nothing#4754

Draft
jurgenwerk wants to merge 1 commit intocs-11034-software-factory-replace-openrouter-backend-with-opencodefrom
cs-11102-fail-vacuous-validation
Draft

factory:go: stop marking issues 'done' when the agent wrote nothing#4754
jurgenwerk wants to merge 1 commit intocs-11034-software-factory-replace-openrouter-backend-with-opencodefrom
cs-11102-fail-vacuous-validation

Conversation

@jurgenwerk
Copy link
Copy Markdown
Contributor

@jurgenwerk jurgenwerk commented May 11, 2026

What this fixes

factory:go marked an issue as done even though the agent wrote no card, no tests, no Spec — nothing. The catalog showed a green DONE badge and outcome=all_issues_done in the JSON output. Real test-58 run, observed.

Why the agent under-delivered

On the SN-1 iteration, the agent ran for about 70 seconds, made one bash call, edited the issue file once, and called signal_done. No card definition, no tests, no Spec, no instances. The session closed cleanly — no network error, no credit gate hit, no crash. The agent just decided that was enough.

This is the long-horizon-agent failure mode the orchestrator has to be robust against. We see the tool calls the agent made; we don't see the reasoning chain that led there. The same prompts and skills produced a complete sticky note implementation on test-55 a day earlier and on test-55-claude / test-58-claude this week — so it isn't a deterministic prompt bug. It's the kind of under-delivery you have to plan for in any agent loop: rare, hard to reproduce, harder to engineer away upstream.

Improving the agent's reliability is its own track of work (prompts, skill content, evals, model selection). Until that lands, the orchestrator should not accept signal_done on its face when the workspace contradicts it. This PR is that backstop.

Why the orchestrator accepted it

Each validator only asks one question: "are the files I'm responsible for broken?" No .gtslint honestly returns "passed, nothing to lint." Same for parse, eval, test, instantiate. Five truthful "nothing to validate" answers roll up into one validationResults.passed = true.

The orchestrator's done-check is signal_done + validation passed + sync ok. All three were true. The issue gets marked done. Nobody is asking "should there have been a card here?" The validators don't know what the ticket wanted; the orchestrator only reads the boolean.

The fix

After validation runs, if it's a non-bootstrap issue and every validator reports zero items checked, force the result to failed. The agent gets a clear "you wrote nothing, write the files the ticket asked for" message on the next iteration. If it never recovers, the existing max-iterations path marks the issue blocked — not silently done.

Bootstrap is exempt (it correctly produces only tracker cards, no .gts).

Tests

Three cases in tests/issue-loop.test.ts:

  • Non-bootstrap + nothing written → blocked (the bug, reproduced)
  • Bootstrap + nothing-to-validate → still done (exempt)
  • Non-bootstrap + real work → still done (no regression)

333/333 pnpm test:node pass.

Why target cs-11034 instead of main

To actually exercise this end-to-end you need PR #4653's prompt fixes. Once #4653 merges, GitHub rebases this onto main automatically.

🤖 Generated with Claude Code

CS-11102.

When all five validator steps report "nothing to validate" on a
non-bootstrap issue, the orchestrator now treats that as a failed
validation instead of trusting the aggregated `passed: true`.

The bug: validators correctly return `passed: true` with zero items
checked when their input set is empty ("no .gts → nothing to lint"),
but the pipeline aggregates five vacuous passes into one overall
`passed: true`. An agent that calls `signal_done` without writing
any of the .gts / .test.gts / Spec / instance artifacts the ticket
requires then satisfies the done-gate (`agentSignaledDone &&
validationResults.passed && !syncFailed`) and the issue is marked
done with nothing actually implemented. Observed on a real test-58
factory:go run.

The fix lives in the issue-loop right after validation completes:

1. Skip bootstrap issues — they legitimately produce only tracker
   cards (Issue / Project / KnowledgeArticle) and have no .gts /
   test / Spec, so a vacuous validation pass is expected and
   correct for them.
2. Inspect each step's per-step counts (`details.filesChecked`,
   `modulesChecked`, `cardsChecked`, `passedCount + failedCount +
   skippedCount`). If every step reports zero, override
   `validationResults.passed = false` and inject a clear
   `validationContext` for the agent's next iteration: "validation
   failed: every step said nothing to validate; you must write the
   card definition, test, Spec, and at least one sample instance
   before signaling done."

The existing max-iterations path takes over from there — if the
agent never recovers across the configured iteration budget, the
issue ends up cleanly `blocked` with a comment instead of silently
`done`.

Tests cover three cases:
- Non-bootstrap + signal_done + vacuous pass on every iteration →
  guard fires repeatedly, issue exits `blocked` (not `done`).
- Bootstrap + signal_done + vacuous pass → issue marks `done`
  (guard exempts bootstrap).
- Non-bootstrap + signal_done + non-vacuous pass → issue marks
  `done` (sanity check: real passes still work).

Also touches `makePassingValidation()` / `makeFailingValidation()`
fixtures in the test file to include realistic `details` counts —
real validators always populate these, and without them the new
guard would (incorrectly) fire on every existing happy-path test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant