factory:go: stop marking issues 'done' when the agent wrote nothing#4754
Draft
jurgenwerk wants to merge 1 commit intocs-11034-software-factory-replace-openrouter-backend-with-opencodefrom
Draft
Conversation
CS-11102.
When all five validator steps report "nothing to validate" on a
non-bootstrap issue, the orchestrator now treats that as a failed
validation instead of trusting the aggregated `passed: true`.
The bug: validators correctly return `passed: true` with zero items
checked when their input set is empty ("no .gts → nothing to lint"),
but the pipeline aggregates five vacuous passes into one overall
`passed: true`. An agent that calls `signal_done` without writing
any of the .gts / .test.gts / Spec / instance artifacts the ticket
requires then satisfies the done-gate (`agentSignaledDone &&
validationResults.passed && !syncFailed`) and the issue is marked
done with nothing actually implemented. Observed on a real test-58
factory:go run.
The fix lives in the issue-loop right after validation completes:
1. Skip bootstrap issues — they legitimately produce only tracker
cards (Issue / Project / KnowledgeArticle) and have no .gts /
test / Spec, so a vacuous validation pass is expected and
correct for them.
2. Inspect each step's per-step counts (`details.filesChecked`,
`modulesChecked`, `cardsChecked`, `passedCount + failedCount +
skippedCount`). If every step reports zero, override
`validationResults.passed = false` and inject a clear
`validationContext` for the agent's next iteration: "validation
failed: every step said nothing to validate; you must write the
card definition, test, Spec, and at least one sample instance
before signaling done."
The existing max-iterations path takes over from there — if the
agent never recovers across the configured iteration budget, the
issue ends up cleanly `blocked` with a comment instead of silently
`done`.
Tests cover three cases:
- Non-bootstrap + signal_done + vacuous pass on every iteration →
guard fires repeatedly, issue exits `blocked` (not `done`).
- Bootstrap + signal_done + vacuous pass → issue marks `done`
(guard exempts bootstrap).
- Non-bootstrap + signal_done + non-vacuous pass → issue marks
`done` (sanity check: real passes still work).
Also touches `makePassingValidation()` / `makeFailingValidation()`
fixtures in the test file to include realistic `details` counts —
real validators always populate these, and without them the new
guard would (incorrectly) fire on every existing happy-path test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this fixes
factory:gomarked an issue as done even though the agent wrote no card, no tests, no Spec — nothing. The catalog showed a green DONE badge andoutcome=all_issues_donein the JSON output. Real test-58 run, observed.Why the agent under-delivered
On the SN-1 iteration, the agent ran for about 70 seconds, made one
bashcall, edited the issue file once, and calledsignal_done. No card definition, no tests, no Spec, no instances. The session closed cleanly — no network error, no credit gate hit, no crash. The agent just decided that was enough.This is the long-horizon-agent failure mode the orchestrator has to be robust against. We see the tool calls the agent made; we don't see the reasoning chain that led there. The same prompts and skills produced a complete sticky note implementation on test-55 a day earlier and on test-55-claude / test-58-claude this week — so it isn't a deterministic prompt bug. It's the kind of under-delivery you have to plan for in any agent loop: rare, hard to reproduce, harder to engineer away upstream.
Improving the agent's reliability is its own track of work (prompts, skill content, evals, model selection). Until that lands, the orchestrator should not accept
signal_doneon its face when the workspace contradicts it. This PR is that backstop.Why the orchestrator accepted it
Each validator only asks one question: "are the files I'm responsible for broken?" No
.gts→linthonestly returns "passed, nothing to lint." Same for parse, eval, test, instantiate. Five truthful "nothing to validate" answers roll up into onevalidationResults.passed = true.The orchestrator's done-check is
signal_done + validation passed + sync ok. All three were true. The issue gets marked done. Nobody is asking "should there have been a card here?" The validators don't know what the ticket wanted; the orchestrator only reads the boolean.The fix
After validation runs, if it's a non-bootstrap issue and every validator reports zero items checked, force the result to
failed. The agent gets a clear "you wrote nothing, write the files the ticket asked for" message on the next iteration. If it never recovers, the existing max-iterations path marks the issueblocked— not silently done.Bootstrap is exempt (it correctly produces only tracker cards, no
.gts).Tests
Three cases in
tests/issue-loop.test.ts:333/333
pnpm test:nodepass.Why target
cs-11034instead ofmainTo actually exercise this end-to-end you need PR #4653's prompt fixes. Once #4653 merges, GitHub rebases this onto main automatically.
🤖 Generated with Claude Code