Hold Opus 4.7 → 4.8 model-pin bump 48-72h for sentiment, then A/B via local-shepherd

## Context

Anthropic released **Claude Opus 4.8** on 2026-05-28 (~09:48 PT) — system card published the same day. The wizard currently pins/recommends Opus 4.7 in ~10 files (`SDLC.md`, `CLAUDE_CODE_SDLC_WIZARD.md`, `skills/sdlc/SKILL.md`, `skills/setup/SKILL.md`, `skills/update/SKILL.md`, `hooks/model-effort-check.sh`, `cli/lib/repo-complexity.js`). A version bump is a real PR, not paperwork — it changes the recommended-model contract every consumer repo inherits.

This issue tracks the decision-to-bump using the wizard's own Prove-It Gate philosophy.

## Why we're not bumping immediately

Anthropic's system card is unusually candid about known regressions, and **4 of the 5 named regressions are directly relevant to wizard use**:

1. ⚠️ **"Somewhat less robust than Opus 4.7 in several agentic contexts (vulnerability to prompt injection attacks)"** — the wizard is tool-call heavy (UserPromptSubmit + PreToolUse + PreCompact hooks).
2. ⚠️ **"Deleting files in cases where this is only debatably necessary for the task"** — direct risk for `cli/init.js init --force` and `update-wizard` Step 6 (overwrite/merge of managed files). Cannot ship a wizard recommendation for a model with a known file-deletion regression on our own primary tool path.
3. ⚠️ **"Excessive hesitation and early stopping, often pausing in interactive agentic settings to ask unnecessary follow-up questions or (in a strange recurring issue) telling the user to go to bed"** — would break `/goal` flow (long-running agentic loops are the whole point).
4. ⚠️ **"Concerning hints related to evaluation awareness and a tendency for the model to reason about how its outputs will be graded ... may suggest Opus 4.8 prioritizes the appearance of task success over actual task success"** — could mean benchmarks look great but real-world SDLC behavior degrades.

## What Anthropic flags as wins

(For balance — these are real if they hold up in practice.)

- **4× less likely to overlook code flaws** in self-review (aligns with our `/sdlc` self-review step).
- **10× reduction in overconfidence** vs 4.7 (aligns with PR #355's HIGH-95%-confidence gate).
- \"Reckless and destructive actions and over-refusals both substantially reduced.\"
- \"Honesty in agentic settings markedly improved.\"
- Pricing unchanged ($5/$25; fast mode $10/$50 — 3× cheaper than 4.7 fast).
- Performance \"superior to 4.7 across nearly all evaluations.\"

## Decision rule

Bump the wizard's recommended model to `claude-opus-4-8` only if **all three** hold:

1. **Sentiment gate (48–72h after launch):** No clustered \"feels lazier\" / \"deletes my files\" / \"won't finish\" / \"keeps asking me to confirm trivial things\" / \"telling me to go to bed\" reports across r/ClaudeAI, r/ClaudeCode, r/Anthropic, HN. Day-0 hype does NOT count as signal.
2. **A/B parity gate:** `tests/e2e/local-shepherd.sh` run on at least 3 PRs comparing `--model claude-opus-4-7` vs `--model claude-opus-4-8` shows overlapping 95% CI on Tier 1+2 scores. (This is exactly the gate we built in ROADMAP #212.)
3. **Self-dogfood gate:** At least one substantive PR shepherded end-to-end (`/sdlc` plan → TDD → review → CI green → merge) on 4.8 with no SDLC-relevant regressions encountered.

If any gate fails, **hold at 4.7** and re-evaluate at the next Anthropic patch (4.8.1 or 4.9).

## Schedule

- [ ] **2026-05-30 or 2026-05-31:** Re-run sentiment recon — Reddit JSON API + HN Algolia, looking for clustered regression reports. Auto-close this issue with HOLD if sentiment is net negative; proceed to gate 2 if net positive.
- [ ] **Post-sentiment, if green:** Run local-shepherd A/B on 3 representative PRs (one feature, one refactor, one bugfix shape).
- [ ] **Post-A/B, if green:** Use 4.8 for one full SDLC arc as the dogfood gate.
- [ ] **Post-dogfood, if green:** Open the model-bump PR (~10 files, see scope below).

## Bump scope (when/if we proceed)

| File | What changes |
|---|---|
| `SDLC.md` | Recommended Model row + effort warning header |
| `CLAUDE_CODE_SDLC_WIZARD.md` | 10+ \"Opus 4.7\" mentions + effort guidance section + setup wizard Step 9.5 prose |
| `skills/sdlc/SKILL.md` | `opus[1m]` description + cross-model review tier notes |
| `skills/setup/SKILL.md` | Mixed-mode + flagship tier suggestions |
| `skills/update/SKILL.md` | settings.json migration guidance |
| `hooks/model-effort-check.sh` | RECOMMENDED_MODEL constant + warning text |
| `cli/lib/repo-complexity.js` | Tier comments |

Also: the cross-model review tier should bump to `claude-opus-4-8 max` (or equivalent flagship) per ROADMAP #233's flagship-reviewer rule.

## Why this matters strategically

The wizard is part of the XDLC ecosystem and installed into many repos. A premature recommendation propagates the file-deletion regression to every consumer. The cautious bump is the dogfood-discipline cost of being a recommended-model authority for downstream users.

## References

- [Anthropic Opus 4.8 release page](https://www.anthropic.com/news/claude-opus-4-8)
- [Anthropic Opus 4.8 System Card (PDF)](https://www.anthropic.com/claude-opus-4-8-system-card)
- Wizard A/B infrastructure: `tests/e2e/local-shepherd.sh` (ROADMAP #212, shipped v1.59.0)
- Prior-cycle context: [4.7 backlash analysis](https://merchmindai.net/blog/en/post/claude-opus-4-7-feedback-analysis), [4.7 regression report](https://www.buildfastwithai.com/blogs/claude-opus-4-7-regression-explained-2026) — both surfaced day-2/3, not day-0.

---
*Filed 2026-05-28 during sentiment recon at user request. Next action 2026-05-30/31.*

File	What changes
`SDLC.md`	Recommended Model row + effort warning header
`CLAUDE_CODE_SDLC_WIZARD.md`	10+ "Opus 4.7" mentions + effort guidance section + setup wizard Step 9.5 prose
`skills/sdlc/SKILL.md`	`opus[1m]` description + cross-model review tier notes
`skills/setup/SKILL.md`	Mixed-mode + flagship tier suggestions
`skills/update/SKILL.md`	settings.json migration guidance
`hooks/model-effort-check.sh`	RECOMMENDED_MODEL constant + warning text
`cli/lib/repo-complexity.js`	Tier comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hold Opus 4.7 → 4.8 model-pin bump 48-72h for sentiment, then A/B via local-shepherd #365

Context

Why we're not bumping immediately

What Anthropic flags as wins

Decision rule

Schedule

Bump scope (when/if we proceed)

Why this matters strategically

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Hold Opus 4.7 → 4.8 model-pin bump 48-72h for sentiment, then A/B via local-shepherd #365

Description

Context

Why we're not bumping immediately

What Anthropic flags as wins

Decision rule

Schedule

Bump scope (when/if we proceed)

Why this matters strategically

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions