Skip to content

Commit 4cd4d11

Browse files
garrytanclaude
andauthored
feat: design outside voices — cross-model design critique (v0.11.3.0) (#347)
* feat(gen-skill-docs): add design outside voices + hard rules resolvers Add generateDesignOutsideVoices() — parallel Codex + Claude subagent dispatch for cross-model design critique with litmus scorecard synthesis. Branches per skillName (plan-design-review, design-review, design-consultation) with task-specific reasoning effort (high for analytical, medium for creative). Add generateDesignHardRules() — OpenAI Frontend Skill hard rules + gstack AI slop blacklist unified into one shared block with classifier step (landing page vs app UI vs hybrid). Extract AI_SLOP_BLACKLIST constant from inline prose in generateDesignMethodology() for DRY. Extend generateDesignReviewLite() with lightweight Codex block. Extend generateDesignSketch() with outside voices opt-in after wireframe. Source: OpenAI "Designing Delightful Frontends with GPT-5.4" (Mar 2026) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat(design skills): add outside voices + hard rules to all design templates Insert {{DESIGN_OUTSIDE_VOICES}} in plan-design-review (between Step 0D and Pass 1), design-review (between Phase 6 and Phase 7), and design-consultation (between Phase 2 and Phase 3). Insert {{DESIGN_HARD_RULES}} in plan-design-review Pass 4 and design-review Phase 3 checklist. DESIGN_REVIEW_LITE in /ship and /review now includes a Codex design voice block with litmus checks. DESIGN_SKETCH in /office-hours now includes outside voices opt-in after wireframe approval. Regenerated all SKILL.md files (both Claude and Codex hosts). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test: add resolver tests + touchfiles for design outside voices Add 18 test cases across 4 new describe blocks: - DESIGN_OUTSIDE_VOICES: host guard, skillName branching, reasoning effort - DESIGN_HARD_RULES: classifier, 3 rule sets, slop blacklist, OpenAI criteria - DESIGN_SKETCH extended: outside voices step, original wireframe preserved - DESIGN_REVIEW_LITE extended: Codex block, codex host exclusion Update touchfiles: add scripts/gen-skill-docs.ts to design skill E2E test dependencies for accurate diff-based test selection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.11.3.0) Design outside voices — parallel Codex + Claude subagent for cross-model design critique with litmus scorecard synthesis. OpenAI hard rules + gstack slop blacklist unified. Classifier for landing page vs app UI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: generate .agents/ on demand in tests (not checked in since v0.11.2.0) .agents/ is gitignored since v0.11.2.0 — tests that read Codex-host SKILL.md files now generate them on demand via `bun run gen-skill-docs.ts --host codex` before reading. Fixes test failures on fresh clones. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent b7a3bf1 commit 4cd4d11

16 files changed

Lines changed: 999 additions & 23 deletions

File tree

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,15 @@
11
# Changelog
22

3+
## [0.11.3.0] - 2026-03-23 — Design Outside Voices
4+
5+
### Added
6+
7+
- **Every design review now gets a second opinion.** `/plan-design-review`, `/design-review`, and `/design-consultation` dispatch both Codex (OpenAI) and a fresh Claude subagent in parallel to independently evaluate your design — then synthesize findings with a litmus scorecard showing where they agree and disagree. Cross-model agreement = high confidence; disagreement = investigate.
8+
- **OpenAI's design hard rules baked in.** 7 hard rejection criteria, 7 litmus checks, and a landing-page vs app-UI classifier from OpenAI's "Designing Delightful Frontends" framework — merged with gstack's existing 10-item AI slop blacklist. Your design gets evaluated against the same rules OpenAI recommends for their own models.
9+
- **Codex design voice in every PR.** The lightweight design review that runs in `/ship` and `/review` now includes a Codex design check when frontend files change — automatic, no opt-in needed.
10+
- **Outside voices in /office-hours brainstorming.** After wireframe sketches, you can now get Codex + Claude subagent design perspectives on your approaches before committing to a direction.
11+
- **AI slop blacklist extracted as shared constant.** The 10 anti-patterns (purple gradients, 3-column icon grids, centered everything, etc.) are now defined once and shared across all design skills. Easier to maintain, impossible to drift.
12+
313
## [0.11.2.0] - 2026-03-22 — Codex Just Works
414

515
### Fixed

TODOS.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -432,6 +432,30 @@ Shipped: Default model changed to Sonnet for structure tests (~30), Opus retaine
432432

433433
Shipped as v0.5.0 on main. Includes `/plan-design-review` (report-only design audit), `/qa-design-review` (audit + fix loop), and `/design-consultation` (interactive DESIGN.md creation). `{{DESIGN_METHODOLOGY}}` resolver provides shared 80-item design audit checklist.
434434

435+
### Design outside voices in /plan-eng-review
436+
437+
**What:** Extend the parallel dual-voice pattern (Codex + Claude subagent) to /plan-eng-review's architecture review section.
438+
439+
**Why:** The design beachhead (v0.11.3.0) proves cross-model consensus works for subjective reviews. Architecture reviews have similar subjectivity in tradeoff decisions.
440+
441+
**Context:** Depends on learnings from the design beachhead. If the litmus scorecard format proves useful, adapt it for architecture dimensions (coupling, scaling, reversibility).
442+
443+
**Effort:** S
444+
**Priority:** P3
445+
**Depends on:** Design outside voices shipped (v0.11.3.0)
446+
447+
### Outside voices in /qa visual regression detection
448+
449+
**What:** Add Codex design voice to /qa for detecting visual regressions during bug-fix verification.
450+
451+
**Why:** When fixing bugs, the fix can introduce visual regressions that code-level checks miss. Codex could flag "the fix broke the responsive layout" during re-test.
452+
453+
**Context:** Depends on /qa having design awareness. Currently /qa focuses on functional testing.
454+
455+
**Effort:** M
456+
**Priority:** P3
457+
**Depends on:** Design outside voices shipped (v0.11.3.0)
458+
435459
## Document-Release
436460

437461
### Auto-invoke /document-release from /ship — SHIPPED

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.11.2.0
1+
0.11.3.0

design-consultation/SKILL.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -423,6 +423,71 @@ If the user said no research, skip entirely and proceed to Phase 3 using your bu
423423

424424
---
425425

426+
## Design Outside Voices (parallel)
427+
428+
Use AskUserQuestion:
429+
> "Want outside design voices? Codex evaluates against OpenAI's design hard rules + litmus checks; Claude subagent does an independent design direction proposal."
430+
>
431+
> A) Yes — run outside design voices
432+
> B) No — proceed without
433+
434+
If user chooses B, skip this step and continue.
435+
436+
**Check Codex availability:**
437+
```bash
438+
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
439+
```
440+
441+
**If Codex is available**, launch both voices simultaneously:
442+
443+
1. **Codex design voice** (via Bash):
444+
```bash
445+
TMPERR_DESIGN=$(mktemp /tmp/codex-design-XXXXXXXX)
446+
codex exec "Given this product context, propose a complete design direction:
447+
- Visual thesis: one sentence describing mood, material, and energy
448+
- Typography: specific font names (not defaults — no Inter/Roboto/Arial/system) + hex colors
449+
- Color system: CSS variables for background, surface, primary text, muted text, accent
450+
- Layout: composition-first, not component-first. First viewport as poster, not document
451+
- Differentiation: 2 deliberate departures from category norms
452+
- Anti-slop: no purple gradients, no 3-column icon grids, no centered everything, no decorative blobs
453+
454+
Be opinionated. Be specific. Do not hedge. This is YOUR design direction — own it." -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached 2>"$TMPERR_DESIGN"
455+
```
456+
Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr:
457+
```bash
458+
cat "$TMPERR_DESIGN" && rm -f "$TMPERR_DESIGN"
459+
```
460+
461+
2. **Claude design subagent** (via Agent tool):
462+
Dispatch a subagent with this prompt:
463+
"Given this product context, propose a design direction that would SURPRISE. What would the cool indie studio do that the enterprise UI team wouldn't?
464+
- Propose an aesthetic direction, typography stack (specific font names), color palette (hex values)
465+
- 2 deliberate departures from category norms
466+
- What emotional reaction should the user have in the first 3 seconds?
467+
468+
Be bold. Be specific. No hedging."
469+
470+
**Error handling (all non-blocking):**
471+
- **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key": "Codex authentication failed. Run `codex login` to authenticate."
472+
- **Timeout:** "Codex timed out after 5 minutes."
473+
- **Empty response:** "Codex returned no response."
474+
- On any Codex error: proceed with Claude subagent output only, tagged `[single-model]`.
475+
- If Claude subagent also fails: "Outside voices unavailable — continuing with primary review."
476+
477+
Present Codex output under a `CODEX SAYS (design direction):` header.
478+
Present subagent output under a `CLAUDE SUBAGENT (design direction):` header.
479+
480+
**Synthesis:** Claude main references both Codex and subagent proposals in the Phase 3 proposal. Present:
481+
- Areas of agreement between all three voices (Claude main + Codex + subagent)
482+
- Genuine divergences as creative alternatives for the user to choose from
483+
- "Codex and I agree on X. Codex suggested Y where I'm proposing Z — here's why..."
484+
485+
**Log the result:**
486+
```bash
487+
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"design-outside-voices","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","commit":"'"$(git rev-parse --short HEAD)"'"}'
488+
```
489+
Replace STATUS with "clean" or "issues_found", SOURCE with "codex+subagent", "codex-only", "subagent-only", or "unavailable".
490+
426491
## Phase 3: The Complete Proposal
427492

428493
This is the soul of the skill. Propose EVERYTHING as one coherent package.

design-consultation/SKILL.md.tmpl

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,8 @@ If the user said no research, skip entirely and proceed to Phase 3 using your bu
131131

132132
---
133133

134+
{{DESIGN_OUTSIDE_VOICES}}
135+
134136
## Phase 3: The Complete Proposal
135137

136138
This is the soul of the skill. Propose EVERYTHING as one coherent package.

design-review/SKILL.md

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -856,6 +856,75 @@ Tie everything to user goals and product objectives. Always suggest specific imp
856856
10. **Depth over breadth.** 5-10 well-documented findings with screenshots and specific suggestions > 20 vague observations.
857857
11. **Show screenshots to the user.** After every `$B screenshot`, `$B snapshot -a -o`, or `$B responsive` command, use the Read tool on the output file(s) so the user can see them inline. For `responsive` (3 files), Read all three. This is critical — without it, screenshots are invisible to the user.
858858

859+
### Design Hard Rules
860+
861+
**Classifier — determine rule set before evaluating:**
862+
- **MARKETING/LANDING PAGE** (hero-driven, brand-forward, conversion-focused) → apply Landing Page Rules
863+
- **APP UI** (workspace-driven, data-dense, task-focused: dashboards, admin, settings) → apply App UI Rules
864+
- **HYBRID** (marketing shell with app-like sections) → apply Landing Page Rules to hero/marketing sections, App UI Rules to functional sections
865+
866+
**Hard rejection criteria** (instant-fail patterns — flag if ANY apply):
867+
1. Generic SaaS card grid as first impression
868+
2. Beautiful image with weak brand
869+
3. Strong headline with no clear action
870+
4. Busy imagery behind text
871+
5. Sections repeating same mood statement
872+
6. Carousel with no narrative purpose
873+
7. App UI made of stacked cards instead of layout
874+
875+
**Litmus checks** (answer YES/NO for each — used for cross-model consensus scoring):
876+
1. Brand/product unmistakable in first screen?
877+
2. One strong visual anchor present?
878+
3. Page understandable by scanning headlines only?
879+
4. Each section has one job?
880+
5. Are cards actually necessary?
881+
6. Does motion improve hierarchy or atmosphere?
882+
7. Would design feel premium with all decorative shadows removed?
883+
884+
**Landing page rules** (apply when classifier = MARKETING/LANDING):
885+
- First viewport reads as one composition, not a dashboard
886+
- Brand-first hierarchy: brand > headline > body > CTA
887+
- Typography: expressive, purposeful — no default stacks (Inter, Roboto, Arial, system)
888+
- No flat single-color backgrounds — use gradients, images, subtle patterns
889+
- Hero: full-bleed, edge-to-edge, no inset/tiled/rounded variants
890+
- Hero budget: brand, one headline, one supporting sentence, one CTA group, one image
891+
- No cards in hero. Cards only when card IS the interaction
892+
- One job per section: one purpose, one headline, one short supporting sentence
893+
- Motion: 2-3 intentional motions minimum (entrance, scroll-linked, hover/reveal)
894+
- Color: define CSS variables, avoid purple-on-white defaults, one accent color default
895+
- Copy: product language not design commentary. "If deleting 30% improves it, keep deleting"
896+
- Beautiful defaults: composition-first, brand as loudest text, two typefaces max, cardless by default, first viewport as poster not document
897+
898+
**App UI rules** (apply when classifier = APP UI):
899+
- Calm surface hierarchy, strong typography, few colors
900+
- Dense but readable, minimal chrome
901+
- Organize: primary workspace, navigation, secondary context, one accent
902+
- Avoid: dashboard-card mosaics, thick borders, decorative gradients, ornamental icons
903+
- Copy: utility language — orientation, status, action. Not mood/brand/aspiration
904+
- Cards only when card IS the interaction
905+
- Section headings state what area is or what user can do ("Selected KPIs", "Plan status")
906+
907+
**Universal rules** (apply to ALL types):
908+
- Define CSS variables for color system
909+
- No default font stacks (Inter, Roboto, Arial, system)
910+
- One job per section
911+
- "If deleting 30% of the copy improves it, keep deleting"
912+
- Cards earn their existence — no decorative card grids
913+
914+
**AI Slop blacklist** (the 10 patterns that scream "AI-generated"):
915+
1. Purple/violet/indigo gradient backgrounds or blue-to-purple color schemes
916+
2. **The 3-column feature grid:** icon-in-colored-circle + bold title + 2-line description, repeated 3x symmetrically. THE most recognizable AI layout.
917+
3. Icons in colored circles as section decoration (SaaS starter template look)
918+
4. Centered everything (`text-align: center` on all headings, descriptions, cards)
919+
5. Uniform bubbly border-radius on every element (same large radius on everything)
920+
6. Decorative blobs, floating circles, wavy SVG dividers (if a section feels empty, it needs better content, not decoration)
921+
7. Emoji as design elements (rockets in headings, emoji as bullet points)
922+
8. Colored left-border on cards (`border-left: 3px solid <accent>`)
923+
9. Generic hero copy ("Welcome to [X]", "Unlock the power of...", "Your all-in-one solution for...")
924+
10. Cookie-cutter section rhythm (hero → 3 features → testimonials → pricing → CTA, every section same height)
925+
926+
Source: [OpenAI "Designing Delightful Frontends with GPT-5.4"](https://developers.openai.com/blog/designing-delightful-frontends-with-gpt-5-4) (Mar 2026) + gstack design methodology.
927+
859928
Record baseline design score and AI slop score at end of Phase 6.
860929

861930
---
@@ -879,6 +948,87 @@ Record baseline design score and AI slop score at end of Phase 6.
879948

880949
---
881950

951+
## Design Outside Voices (parallel)
952+
953+
**Automatic:** Outside voices run automatically when Codex is available. No opt-in needed.
954+
955+
**Check Codex availability:**
956+
```bash
957+
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
958+
```
959+
960+
**If Codex is available**, launch both voices simultaneously:
961+
962+
1. **Codex design voice** (via Bash):
963+
```bash
964+
TMPERR_DESIGN=$(mktemp /tmp/codex-design-XXXXXXXX)
965+
codex exec "Review the frontend source code in this repo. Evaluate against these design hard rules:
966+
- Spacing: systematic (design tokens / CSS variables) or magic numbers?
967+
- Typography: expressive purposeful fonts or default stacks?
968+
- Color: CSS variables with defined system, or hardcoded hex scattered?
969+
- Responsive: breakpoints defined? calc(100svh - header) for heroes? Mobile tested?
970+
- A11y: ARIA landmarks, alt text, contrast ratios, 44px touch targets?
971+
- Motion: 2-3 intentional animations, or zero / ornamental only?
972+
- Cards: used only when card IS the interaction? No decorative card grids?
973+
974+
First classify as MARKETING/LANDING PAGE vs APP UI vs HYBRID, then apply matching rules.
975+
976+
LITMUS CHECKS — answer YES/NO:
977+
1. Brand/product unmistakable in first screen?
978+
2. One strong visual anchor present?
979+
3. Page understandable by scanning headlines only?
980+
4. Each section has one job?
981+
5. Are cards actually necessary?
982+
6. Does motion improve hierarchy or atmosphere?
983+
7. Would design feel premium with all decorative shadows removed?
984+
985+
HARD REJECTION — flag if ANY apply:
986+
1. Generic SaaS card grid as first impression
987+
2. Beautiful image with weak brand
988+
3. Strong headline with no clear action
989+
4. Busy imagery behind text
990+
5. Sections repeating same mood statement
991+
6. Carousel with no narrative purpose
992+
7. App UI made of stacked cards instead of layout
993+
994+
Be specific. Reference file:line for every finding." -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR_DESIGN"
995+
```
996+
Use a 5-minute timeout (`timeout: 300000`). After the command completes, read stderr:
997+
```bash
998+
cat "$TMPERR_DESIGN" && rm -f "$TMPERR_DESIGN"
999+
```
1000+
1001+
2. **Claude design subagent** (via Agent tool):
1002+
Dispatch a subagent with this prompt:
1003+
"Review the frontend source code in this repo. You are an independent senior product designer doing a source-code design audit. Focus on CONSISTENCY PATTERNS across files rather than individual violations:
1004+
- Are spacing values systematic across the codebase?
1005+
- Is there ONE color system or scattered approaches?
1006+
- Do responsive breakpoints follow a consistent set?
1007+
- Is the accessibility approach consistent or spotty?
1008+
1009+
For each finding: what's wrong, severity (critical/high/medium), and the file:line."
1010+
1011+
**Error handling (all non-blocking):**
1012+
- **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key": "Codex authentication failed. Run `codex login` to authenticate."
1013+
- **Timeout:** "Codex timed out after 5 minutes."
1014+
- **Empty response:** "Codex returned no response."
1015+
- On any Codex error: proceed with Claude subagent output only, tagged `[single-model]`.
1016+
- If Claude subagent also fails: "Outside voices unavailable — continuing with primary review."
1017+
1018+
Present Codex output under a `CODEX SAYS (design source audit):` header.
1019+
Present subagent output under a `CLAUDE SUBAGENT (design consistency):` header.
1020+
1021+
**Synthesis — Litmus scorecard:**
1022+
1023+
Use the same scorecard format as /plan-design-review (shown above). Fill in from both outputs.
1024+
Merge findings into the triage with `[codex]` / `[subagent]` / `[cross-model]` tags.
1025+
1026+
**Log the result:**
1027+
```bash
1028+
~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"design-outside-voices","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","source":"SOURCE","commit":"'"$(git rev-parse --short HEAD)"'"}'
1029+
```
1030+
Replace STATUS with "clean" or "issues_found", SOURCE with "codex+subagent", "codex-only", "subagent-only", or "unavailable".
1031+
8821032
## Phase 7: Triage
8831033

8841034
Sort all discovered findings by impact, then decide which to fix:

design-review/SKILL.md.tmpl

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,8 @@ mkdir -p "$REPORT_DIR/screenshots"
8484

8585
{{DESIGN_METHODOLOGY}}
8686

87+
{{DESIGN_HARD_RULES}}
88+
8789
Record baseline design score and AI slop score at end of Phase 6.
8890

8991
---
@@ -107,6 +109,8 @@ Record baseline design score and AI slop score at end of Phase 6.
107109

108110
---
109111

112+
{{DESIGN_OUTSIDE_VOICES}}
113+
110114
## Phase 7: Triage
111115

112116
Sort all discovered findings by impact, then decide which to fix:

office-hours/SKILL.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -731,6 +731,35 @@ Reference the wireframe screenshot in the design doc's "Recommended Approach" se
731731
The screenshot file at `/tmp/gstack-sketch.png` can be referenced by downstream skills
732732
(`/plan-design-review`, `/design-review`) to see what was originally envisioned.
733733

734+
**Step 6: Outside design voices** (optional)
735+
736+
After the wireframe is approved, offer outside design perspectives:
737+
738+
```bash
739+
which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
740+
```
741+
742+
If Codex is available, use AskUserQuestion:
743+
> "Want outside design perspectives on the chosen approach? Codex proposes a visual thesis, content plan, and interaction ideas. A Claude subagent proposes an alternative aesthetic direction."
744+
>
745+
> A) Yes — get outside design voices
746+
> B) No — proceed without
747+
748+
If user chooses A, launch both voices simultaneously:
749+
750+
1. **Codex** (via Bash, `model_reasoning_effort="medium"`):
751+
```bash
752+
TMPERR_SKETCH=$(mktemp /tmp/codex-sketch-XXXXXXXX)
753+
codex exec "For this product approach, provide: a visual thesis (one sentence — mood, material, energy), a content plan (hero → support → detail → CTA), and 2 interaction ideas that change page feel. Apply beautiful defaults: composition-first, brand-first, cardless, poster not document. Be opinionated." -s read-only -c 'model_reasoning_effort="medium"' --enable web_search_cached 2>"$TMPERR_SKETCH"
754+
```
755+
Use a 5-minute timeout (`timeout: 300000`). After completion: `cat "$TMPERR_SKETCH" && rm -f "$TMPERR_SKETCH"`
756+
757+
2. **Claude subagent** (via Agent tool):
758+
"For this product approach, what design direction would you recommend? What aesthetic, typography, and interaction patterns fit? What would make this approach feel inevitable to the user? Be specific — font names, hex colors, spacing values."
759+
760+
Present Codex output under `CODEX SAYS (design sketch):` and subagent output under `CLAUDE SUBAGENT (design direction):`.
761+
Error handling: all non-blocking. On failure, skip and continue.
762+
734763
---
735764

736765
## Phase 4.5: Founder Signal Synthesis

0 commit comments

Comments
 (0)