fix(sdlc): teach cross-model review to use Bash background-mode (closes #364)#366
Merged
Merged
Conversation
#364) The SDLC skill's Cross-Model Review section told Claude that "xhigh runs take 1-5 min" and didn't say to use `run_in_background: true` on the Bash tool. The Bash tool clamps `timeout` to 600000 ms (10 min) and silently force-kills foreground codex at the wall — even if Claude passes `timeout: 2400000` under the reasonable assumption that the harness honors it. Real session (2026-05-27): a foreground codex on a multi-artifact bundle got killed at the 10-min wall, the Stop hook re-invoked Claude 9 times before the harness `CLAUDE_CODE_STOP_HOOK_BLOCK_CAP` safeguard fired. Total burn: ~70 min of compute on what would have been a single 7-min review. Changes: - `skills/sdlc/SKILL.md` — flip the run-reviewer line to mandate `run_in_background: true` + `dangerouslyDisableSandbox: true` and explain why (`< /dev/null` prevents stdin-hang, background avoids the 10-min cap). Also compressed three sibling paragraphs to stay under the 5K-token cap (audit was 4999 → 5099 after add; now 4950, 50-token buffer). - `CLAUDE_CODE_SDLC_WIZARD.md` — two sibling callouts (one per Cross-Model Review section), with the general rule: "any long-running wrapper invoked through the CC Bash tool — codex, slow builds, long test suites — should use `run_in_background: true` unconditionally and let the wrapper's own stall watchdog be the timeout authority." - `tests/test-doc-consistency.sh` — new quality test asserts both SKILL.md and wizard doc carry the `run_in_background: true` guidance AND cite the 10-min cap reason. Mutation: removing either string from either file flips the test to FAIL. Verified locally: test-doc-consistency 41/41, test-docs-usability 29/29, test-audit-session-load 10/10 (SKILL.md back under 5K cap), test-hooks 156/156.
PR #366 trims to lines 138 + 144 saved tokens but dropped three strings that test-self-update.sh greps for: - "verification checklist" (space) — I had collapsed to JSON key form `"verification_checklist"` only. Test pattern is case-insensitive with the literal space form. - "respond to each" / "each.*review" — I had compressed to "respond per-reviewer". The test pattern uses BRE where `?` is literal, so per.?review only matches per?review / per-review literally as per + any-char + ?. The "respond to each" alternative is the safer match. - "non-code domains" — I had trimmed to "Non-code:". The pattern is non-code.*domain so the suffix is load-bearing. Net char change: +40 chars (4960 tokens, 40-token buffer under the 5K cap). Verified locally: test-self-update 153/153, test-audit-session-load 10/10, test-doc-consistency 41/41, test-hooks 156/156.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The SDLC skill's Cross-Model Review section told Claude that "xhigh runs take 1-5 min" and didn't say to use
run_in_background: trueon the Bash tool. The Bash tool clampstimeoutto 600000 ms (10 min) and silently force-kills foreground codex at the wall — even if Claude passestimeout: 2400000under the reasonable assumption the harness honors it.Real session (2026-05-27, fixbot-audit repo): foreground codex on a multi-artifact bundle got killed at the 10-min wall; the Stop hook re-invoked Claude 9 times before the harness
CLAUDE_CODE_STOP_HOOK_BLOCK_CAPsafeguard fired. Total burn: ~70 min of compute on what would have been a single 7-min review. The same prompt fired again withrun_in_background: truecompleted in ~7 min.Changes
skills/sdlc/SKILL.mdrun_in_background: true+dangerouslyDisableSandbox: true; explains why (< /dev/nullprevents stdin-hang at S/0% CPU; background avoids 10-mintimeoutcap). Bundle time corrected from "1–5 min" to "1–30 min" (single ≤5; bundle 5–30). Three sibling paragraphs compressed to keep SKILL.md under the 5K-token cap (#236: 4999 → 5099 after my add → 4950 after trim).CLAUDE_CODE_SDLC_WIZARD.md< /dev/nullcallouts. Includes the general rule: "any long-running wrapper invoked through the CC Bash tool — codex, slow builds, long test suites — should userun_in_background: trueunconditionally and let the wrapper's own stall watchdog be the timeout authority."tests/test-doc-consistency.shtest_codex_review_guidance_teaches_background_modequality test asserts both SKILL.md and wizard doc carryrun_in_background: trueAND cite the 10-min cap reason (600000/10 min/10-min). Mutation: removing either string from either file flips test to FAIL.Closes
Test plan
tests/test-doc-consistency.sh— 41/41 (new [bug] SDLC codex-review guidance lets foreground Bash kill xhigh runs at 10-min cap #364 test included)tests/test-docs-usability.sh— 29/29tests/test-audit-session-load.sh— 10/10 (SKILL.md back under 5K cap, 50-token buffer)tests/test-hooks.sh— 156/156 (pr_numberschema coverage maintained after the trim)validateWhy this matters
The wizard is the recommendation engine for downstream consumer repos. Every consumer that runs
/sdlccross-model review on a multi-artifact bundle was at risk of the same 70-min burn pattern. Closing the guidance gap upstream is the lowest-cost fix.