factory:go: stop marking issues 'done' when the agent wrote nothing by jurgenwerk · Pull Request #4754 · cardstack/boxel

jurgenwerk · 2026-05-11T09:21:52Z

What this fixes

factory:go marked an issue as done even though the agent wrote no card, no tests, no Spec — nothing. The catalog showed a green DONE badge and outcome=all_issues_done in the JSON output. Real test-58 run, observed.

Why the agent under-delivered

On the SN-1 iteration, the agent ran for about 70 seconds, made one bash call, edited the issue file once, and called signal_done. No card definition, no tests, no Spec, no instances. The session closed cleanly — no network error, no credit gate hit, no crash. The agent just decided that was enough.

This is the long-horizon-agent failure mode the orchestrator has to be robust against. We see the tool calls the agent made; we don't see the reasoning chain that led there. The same prompts and skills produced a complete sticky note implementation on test-55 a day earlier and on test-55-claude / test-58-claude this week — so it isn't a deterministic prompt bug. It's the kind of under-delivery you have to plan for in any agent loop: rare, hard to reproduce, harder to engineer away upstream.

Improving the agent's reliability is its own track of work (prompts, skill content, evals, model selection). Until that lands, the orchestrator should not accept signal_done on its face when the workspace contradicts it. This PR is that backstop.

Why the orchestrator accepted it

Each validator only asks one question: "are the files I'm responsible for broken?" No .gts → lint honestly returns "passed, nothing to lint." Same for parse, eval, test, instantiate. Five truthful "nothing to validate" answers roll up into one validationResults.passed = true.

The orchestrator's done-check is signal_done + validation passed + sync ok. All three were true. The issue gets marked done. Nobody is asking "should there have been a card here?" The validators don't know what the ticket wanted; the orchestrator only reads the boolean.

The fix

After validation runs, if it's a non-bootstrap issue and every validator reports zero items checked, force the result to failed. The agent gets a clear "you wrote nothing, write the files the ticket asked for" message on the next iteration. If it never recovers, the existing max-iterations path marks the issue blocked — not silently done.

Bootstrap is exempt (it correctly produces only tracker cards, no .gts).

Tests

Three cases in tests/issue-loop.test.ts:

Non-bootstrap + nothing written → blocked (the bug, reproduced)
Bootstrap + nothing-to-validate → still done (exempt)
Non-bootstrap + real work → still done (no regression)

333/333 pnpm test:node pass.

Why target `cs-11034` instead of `main`

To actually exercise this end-to-end you need PR #4653's prompt fixes. Once #4653 merges, GitHub rebases this onto main automatically.

🤖 Generated with Claude Code

CS-11102. When all five validator steps report "nothing to validate" on a non-bootstrap issue, the orchestrator now treats that as a failed validation instead of trusting the aggregated `passed: true`. The bug: validators correctly return `passed: true` with zero items checked when their input set is empty ("no .gts → nothing to lint"), but the pipeline aggregates five vacuous passes into one overall `passed: true`. An agent that calls `signal_done` without writing any of the .gts / .test.gts / Spec / instance artifacts the ticket requires then satisfies the done-gate (`agentSignaledDone && validationResults.passed && !syncFailed`) and the issue is marked done with nothing actually implemented. Observed on a real test-58 factory:go run. The fix lives in the issue-loop right after validation completes: 1. Skip bootstrap issues — they legitimately produce only tracker cards (Issue / Project / KnowledgeArticle) and have no .gts / test / Spec, so a vacuous validation pass is expected and correct for them. 2. Inspect each step's per-step counts (`details.filesChecked`, `modulesChecked`, `cardsChecked`, `passedCount + failedCount + skippedCount`). If every step reports zero, override `validationResults.passed = false` and inject a clear `validationContext` for the agent's next iteration: "validation failed: every step said nothing to validate; you must write the card definition, test, Spec, and at least one sample instance before signaling done." The existing max-iterations path takes over from there — if the agent never recovers across the configured iteration budget, the issue ends up cleanly `blocked` with a comment instead of silently `done`. Tests cover three cases: - Non-bootstrap + signal_done + vacuous pass on every iteration → guard fires repeatedly, issue exits `blocked` (not `done`). - Bootstrap + signal_done + vacuous pass → issue marks `done` (guard exempts bootstrap). - Non-bootstrap + signal_done + non-vacuous pass → issue marks `done` (sanity check: real passes still work). Also touches `makePassingValidation()` / `makeFailingValidation()` fixtures in the test file to include realistic `details` counts — real validators always populate these, and without them the new guard would (incorrectly) fire on every existing happy-path test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

factory:go: stop marking issues 'done' when the agent wrote nothing#4754

factory:go: stop marking issues 'done' when the agent wrote nothing#4754
jurgenwerk wants to merge 1 commit intocs-11034-software-factory-replace-openrouter-backend-with-opencodefrom
cs-11102-fail-vacuous-validation

jurgenwerk commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jurgenwerk commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this fixes

Why the agent under-delivered

Why the orchestrator accepted it

The fix

Tests

Why target cs-11034 instead of main

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jurgenwerk commented May 11, 2026 •

edited

Loading

Why target `cs-11034` instead of `main`