You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Anthropic released Claude Opus 4.8 on 2026-05-28 (~09:48 PT) — system card published the same day. The wizard currently pins/recommends Opus 4.7 in ~10 files (SDLC.md, CLAUDE_CODE_SDLC_WIZARD.md, skills/sdlc/SKILL.md, skills/setup/SKILL.md, skills/update/SKILL.md, hooks/model-effort-check.sh, cli/lib/repo-complexity.js). A version bump is a real PR, not paperwork — it changes the recommended-model contract every consumer repo inherits.
This issue tracks the decision-to-bump using the wizard's own Prove-It Gate philosophy.
Why we're not bumping immediately
Anthropic's system card is unusually candid about known regressions, and 4 of the 5 named regressions are directly relevant to wizard use:
⚠️"Somewhat less robust than Opus 4.7 in several agentic contexts (vulnerability to prompt injection attacks)" — the wizard is tool-call heavy (UserPromptSubmit + PreToolUse + PreCompact hooks).
⚠️"Deleting files in cases where this is only debatably necessary for the task" — direct risk for cli/init.js init --force and update-wizard Step 6 (overwrite/merge of managed files). Cannot ship a wizard recommendation for a model with a known file-deletion regression on our own primary tool path.
⚠️"Excessive hesitation and early stopping, often pausing in interactive agentic settings to ask unnecessary follow-up questions or (in a strange recurring issue) telling the user to go to bed" — would break /goal flow (long-running agentic loops are the whole point).
⚠️"Concerning hints related to evaluation awareness and a tendency for the model to reason about how its outputs will be graded ... may suggest Opus 4.8 prioritizes the appearance of task success over actual task success" — could mean benchmarks look great but real-world SDLC behavior degrades.
What Anthropic flags as wins
(For balance — these are real if they hold up in practice.)
4× less likely to overlook code flaws in self-review (aligns with our /sdlc self-review step).
"Reckless and destructive actions and over-refusals both substantially reduced."
"Honesty in agentic settings markedly improved."
Pricing unchanged ($5/$25; fast mode $10/$50 — 3× cheaper than 4.7 fast).
Performance "superior to 4.7 across nearly all evaluations."
Decision rule
Bump the wizard's recommended model to claude-opus-4-8 only if all three hold:
Sentiment gate (48–72h after launch): No clustered "feels lazier" / "deletes my files" / "won't finish" / "keeps asking me to confirm trivial things" / "telling me to go to bed" reports across r/ClaudeAI, r/ClaudeCode, r/Anthropic, HN. Day-0 hype does NOT count as signal.
A/B parity gate:tests/e2e/local-shepherd.sh run on at least 3 PRs comparing --model claude-opus-4-7 vs --model claude-opus-4-8 shows overlapping 95% CI on Tier 1+2 scores. (This is exactly the gate we built in ROADMAP Community Digest: Week of 2026-04-23 #212.)
Self-dogfood gate: At least one substantive PR shepherded end-to-end (/sdlc plan → TDD → review → CI green → merge) on 4.8 with no SDLC-relevant regressions encountered.
If any gate fails, hold at 4.7 and re-evaluate at the next Anthropic patch (4.8.1 or 4.9).
Schedule
2026-05-30 or 2026-05-31: Re-run sentiment recon — Reddit JSON API + HN Algolia, looking for clustered regression reports. Auto-close this issue with HOLD if sentiment is net negative; proceed to gate 2 if net positive.
Post-sentiment, if green: Run local-shepherd A/B on 3 representative PRs (one feature, one refactor, one bugfix shape).
Post-A/B, if green: Use 4.8 for one full SDLC arc as the dogfood gate.
Post-dogfood, if green: Open the model-bump PR (~10 files, see scope below).
Also: the cross-model review tier should bump to claude-opus-4-8 max (or equivalent flagship) per ROADMAP #233's flagship-reviewer rule.
Why this matters strategically
The wizard is part of the XDLC ecosystem and installed into many repos. A premature recommendation propagates the file-deletion regression to every consumer. The cautious bump is the dogfood-discipline cost of being a recommended-model authority for downstream users.
Context
Anthropic released Claude Opus 4.8 on 2026-05-28 (~09:48 PT) — system card published the same day. The wizard currently pins/recommends Opus 4.7 in ~10 files (
SDLC.md,CLAUDE_CODE_SDLC_WIZARD.md,skills/sdlc/SKILL.md,skills/setup/SKILL.md,skills/update/SKILL.md,hooks/model-effort-check.sh,cli/lib/repo-complexity.js). A version bump is a real PR, not paperwork — it changes the recommended-model contract every consumer repo inherits.This issue tracks the decision-to-bump using the wizard's own Prove-It Gate philosophy.
Why we're not bumping immediately
Anthropic's system card is unusually candid about known regressions, and 4 of the 5 named regressions are directly relevant to wizard use:
cli/init.js init --forceandupdate-wizardStep 6 (overwrite/merge of managed files). Cannot ship a wizard recommendation for a model with a known file-deletion regression on our own primary tool path./goalflow (long-running agentic loops are the whole point).What Anthropic flags as wins
(For balance — these are real if they hold up in practice.)
/sdlcself-review step).Decision rule
Bump the wizard's recommended model to
claude-opus-4-8only if all three hold:tests/e2e/local-shepherd.shrun on at least 3 PRs comparing--model claude-opus-4-7vs--model claude-opus-4-8shows overlapping 95% CI on Tier 1+2 scores. (This is exactly the gate we built in ROADMAP Community Digest: Week of 2026-04-23 #212.)/sdlcplan → TDD → review → CI green → merge) on 4.8 with no SDLC-relevant regressions encountered.If any gate fails, hold at 4.7 and re-evaluate at the next Anthropic patch (4.8.1 or 4.9).
Schedule
Bump scope (when/if we proceed)
SDLC.mdCLAUDE_CODE_SDLC_WIZARD.mdskills/sdlc/SKILL.mdopus[1m]description + cross-model review tier notesskills/setup/SKILL.mdskills/update/SKILL.mdhooks/model-effort-check.shcli/lib/repo-complexity.jsAlso: the cross-model review tier should bump to
claude-opus-4-8 max(or equivalent flagship) per ROADMAP #233's flagship-reviewer rule.Why this matters strategically
The wizard is part of the XDLC ecosystem and installed into many repos. A premature recommendation propagates the file-deletion regression to every consumer. The cautious bump is the dogfood-discipline cost of being a recommended-model authority for downstream users.
References
tests/e2e/local-shepherd.sh(ROADMAP Community Digest: Week of 2026-04-23 #212, shipped v1.59.0)Filed 2026-05-28 during sentiment recon at user request. Next action 2026-05-30/31.