Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,20 @@
# Changelog

## [0.9.6.0] - 2026-03-21 — Prism Foundation Fix

### Added

- **Prism now researches before building.** When a feature involves external APIs, services, or tools, Prism spends up to 60 seconds checking official docs, pricing, and constraints before recommending an approach. No more suggesting broken APIs or silently picking the cheapest option.
- **Decision Boundary — Prism knows what to decide silently and what to ask you about.** Engineering decisions (file structure, frameworks, dependency versions) are silent. Product decisions (paid vs free API, capability tradeoffs, anything involving your money) surface as confident recommendations you can approve or redirect.
- **Operator boundary — Prism never sends you to another terminal.** If it can install a dependency, run a command, or configure a tool, it does it itself. Only asks you to act when it genuinely needs your credentials, legal consent, or subjective taste.
- **Behavioral E2E eval for Prism.** The X/Twitter crawler scenario that originally exposed the "watered-down Claude" problem is now an automated test. If Prism regresses, the eval catches it.

### Changed

- **Prism's chunk build cycle is now a single 11-step numbered sequence.** Research gate → specificity gate → build → code review → TDD → tests → LLM drift comparison → precedence. No more separate bullet-point "invisible expert team" wish list.
- **Six contradictory "just build" instructions rewritten.** Prism's momentum is preserved (it still builds with confidence) but now applies the Decision Boundary during the build flow.
- **Precedence table expanded from 6 to 8 entries** with research approach checkpoints and operator-boundary disclosures.

## [0.9.5.0] - 2026-03-21 — CEO Review ↔ Office Hours Chaining

### Added
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.9.5.0
0.9.6.0
2 changes: 2 additions & 0 deletions prism/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.prism/
prism
1,416 changes: 1,416 additions & 0 deletions prism/SKILL.md

Large diffs are not rendered by default.

29 changes: 29 additions & 0 deletions prism/planning/TODOS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# TODOS — Prism Triad MVP

## Post-Phase 1

### Verification calibration loop
**What:** After 3-5 real builds, review instrumentation logs (verification outcomes, override rates, false positive rates) and tune the LLM comparison prompt.
**Why:** First-version LLM self-evaluation prompts are notoriously noisy. Without calibration, users either learn to ignore warnings (defeating the purpose) or get frustrated by false alarms. The instrumentation data from Issue 9 tells you exactly where to adjust.
**Pros:** Targeted improvement with real data instead of guessing.
**Cons:** Requires 3-5 real sessions before it's actionable.
**Context:** The precedence hierarchy (tests > LLM advisory > user override) means false positives from the LLM layer are advisories, not blockers. But too many advisories train users to ignore them. The calibration pass adjusts the comparison prompt sensitivity based on observed false positive rates.
**Depends on:** Phase 1 implementation + at least 3 real builds with instrumentation logging.
**Added:** 2026-03-20 (eng review)

### Prompt precedence documentation
**What:** Define a precedence section in SKILL.md for when behavioral layers conflict (e.g., drift detection fires during verification, scope protection triggers during Socratic questioning).
**Why:** The skill has 6 existing guardrails + communication rules + stage behaviors. The triad adds Socratic depth, verification loops, and smart interrupts. Without explicit precedence, Claude makes inconsistent choices across sessions.
**Pros:** Consistent behavior across sessions. Debuggable when things go wrong.
**Cons:** Requires thinking through ~6 potential pairwise conflicts.
**Context:** Codex flagged this: "no one has defined prompt precedence when these behaviors conflict." Best done after Phase 1 when the actual behaviors exist and conflicts can be observed rather than predicted.
**Depends on:** Phase 1 implementation (need to observe actual conflicts before defining rules).
**Added:** 2026-03-20 (eng review)

## Included in Phase 1 (from eng review)

### Acceptance criteria self-check
**What:** During criteria generation, Claude validates each machine-layer assertion: "Could this actually fail? Is it specific enough to catch a real problem?"
**Why:** Addresses the critical gap where vague assertions cascade into weak tests. The two-layer criteria system is only as good as the machine-layer translation.
**Status:** Include in Phase 1 implementation (not deferred).
**Added:** 2026-03-20 (eng review)
121 changes: 121 additions & 0 deletions prism/planning/ceo-plans/2026-03-20-prism-triad-mvp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
---
status: APPROVED
---
# CEO Plan: Prism — The Triad MVP

Generated by /plan-ceo-review on 2026-03-20
Branch: unknown (product strategy) | Mode: SELECTIVE EXPANSION
Repo: Prism
Supersedes strategic direction of: 2026-03-20-prism-ai-cofounder.md

## Context

This plan follows a reframe discovered in /office-hours on 2026-03-20. The previous CEO plan positioned Prism as a broad AI co-founder (8 scope items, 5-6 weeks CC). This plan narrows to the **Triad MVP** — enhancing the existing 915-line /prism gstack skill to include all three legs of the intent orchestration triad:

1. **Translation** — Socratic questioning that finds the real requirement
2. **Verification** — checking that what was built matches what was intended
3. **Silent Guidance** — invisible guardrails (already exists in current skill)

## Vision

### 10x Check
The 10x version of Prism isn't a bigger product — it's a smarter one. Prism that has run 1,000 builds with 100 different non-technical creators and knows: which questions unlock the real requirement fastest, which verification patterns catch the most fabrications, which build chunk sizes prevent the 80% wall. The product gets better with every session because it learns what works.

### What Changed Since the Prior CEO Plan
The prior plan treated Prism as an "AI company in a box" — business context vault, proactive scanning, learning loop, dashboard canvas. This plan says: **none of that matters if the intent translation doesn't work.** The triad (translation + verification + guidance) is the load-bearing foundation. Everything else is expansion that comes after the foundation is proven.

Key reframe: the product IS the conversation. Not the dashboard, not the agent swarms, not the deployment pipeline. The Socratic questioning + verification loop is Prism's soul. Everything else serves it.

## Scope Decisions

| # | Proposal | Effort | Decision | Reasoning |
|---|----------|--------|----------|-----------|
| 1 | Deepen Phase 2 Socratic questioning | S | ACCEPTED | Core of triad — transforms shallow "what's alive in you?" into deep "why, why, why" |
| 2 | Chunk verification with judgment checkpoints | S | ACCEPTED | Core of triad — surfaces "does this match intent?" after each chunk |
| 3 | Acceptance criteria generation from intent | S | ACCEPTED | Core of triad — defines "done" before code starts |
| 4 | Protocol Template Export | S | ACCEPTED | Auto-generates structured intent doc usable outside Prism |
| 5 | Obsidian vault write-only integration | S | ACCEPTED | Intent docs written to Obsidian for persistence and browsing |
| 6 | Acceptance Criteria Verification Loop | S | ACCEPTED | Compares build output to acceptance criteria, surfaces mismatches |
| 7 | Socratic Depth Calibration (Quick/Standard/Deep) | S | ACCEPTED | Adaptive questioning depth based on user clarity |
| 8 | Obsidian two-way sync | M | DEFERRED | Read business context from vault — important but premature |
| 9 | Multi-machine session handoff | M | DEFERRED | Sharing intent between machines — important but premature |

## Accepted Scope (added to this plan)

**Core triad enhancements (items 1-3):**
- **Item 1: Deeper Socratic questioning.** Deepen Phase 2 with "why" drilling and requirement extraction. If intent remains too vague after questioning, prompt for another round before generating criteria. Depth is adaptive: classified by the LLM via a system prompt that maps the user's opening answer to one of three depth levels ("I have a clear idea" → Quick, "I have a feeling" → Standard, "I'm stuck" → Deep). User can override at any point.
- **Item 2: Judgment checkpoints.** A judgment checkpoint is a blocking AskUserQuestion call that pauses the build until the user confirms or rejects the chunk output. Chunk boundaries align with features in the intent doc (one feature = one chunk, ordered by dependency — rejection of chunk N blocks chunks N+1... until resolved). After each chunk, Prism reads the acceptance criteria for that feature and compares the build output against them. The comparison mechanism is an LLM self-evaluation prompt ("Given these acceptance criteria and this build output, does the output satisfy the criteria? List any mismatches."). On reject: user specifies what's wrong → Prism rebuilds the chunk with the correction → re-verifies. On confirm: log success and proceed.
- **Item 3: Acceptance criteria generation.** From the confirmed intent, generate measurable acceptance criteria (input → expected output pairs) for each feature. Written to `.prism/acceptance-criteria.md`. These are the "definition of done" — used by the verification loop (item 6) during the CREATING stage.

**Cherry-picked expansions (items 4-7):**
- **Item 4: Protocol Template Export.** Auto-generate a structured protocol markdown doc with sections: Problem Statement, Confirmed Intent, Target User, Acceptance Criteria, Build Plan. Usable outside Prism (e.g., with Cursor, Windsurf, or manual builds).
- **Item 5: Obsidian write-only integration.** Write intent docs to configurable Obsidian vault path. Config schema: `{ "obsidian_vault_path": "~/Obsidian/Prism" }` in `.prism/config.json`. If path doesn't exist, warn once and fall back to `.prism/`. Default: disabled (no vault path configured).
- **Item 6: Acceptance Criteria Verification Loop (belt and suspenders).** During CREATING stage, after each chunk is built, two-layer verification: (1) Claude does a quick comparison of output against acceptance criteria for fast feedback — if mismatch, surface as judgment checkpoint (item 2); (2) tdd-guide agent generates deterministic tests derived FROM the acceptance criteria (not generic code-quality tests). Tests are ground truth — they pass or fail without hallucination risk. Claude comparison catches intent drift; tests catch fabrications.
- **Item 7: Socratic Depth Calibration.** Quick (1-2 Qs) for users with a clear plan who just need acceptance criteria generated. Standard (3-5 Qs) default. Deep (5-10 Qs) for users who are exploring. Auto-detected from opening answer, user can override.

## Deferred to TODOS.md
- Obsidian two-way sync (read business context as Socratic input) — Phase 2
- Multi-machine session handoff (share intent between devices) — Phase 2
- Dashboard / visual layer — after triad is validated
- Proactive task discovery (Sentry/GitHub scanning) — after triad is validated
- Agent learning loop — after triad is validated
- Cost tracking per agent/task/feature — after triad is validated

## Architecture

**No new architecture.** This enhances the existing /prism gstack skill:
- Single file: `~/.claude/skills/gstack/prism/SKILL.md`
- Existing state: `.prism/intent.md`, `.prism/state.json`, `.prism/history.jsonl`
- New: `.prism/acceptance-criteria.md` (generated from Phase 2)
- New: optional Obsidian vault path config in `.prism/config.json`
- Existing quality pipeline: code-reviewer, tdd-guide, security-reviewer agents
- New: acceptance criteria comparison step in stage machine

```
EXISTING STAGE MACHINE (enhanced):

VISIONING ─────────► CREATING ──────────► POLISHING ──► SHIPPING ──► DONE
(Phase 1-4) (build chunks) (quality) (deploy)
│ │
NEW: Deeper "why" NEW: After each chunk:
questioning. 1. Read acceptance criteria
Generate acceptance 2. Compare output to criteria
criteria. 3. If mismatch → judgment checkpoint
Adaptive depth 4. If match → log + next chunk
(Quick/Std/Deep).
NEW: Export protocol
template to Obsidian
vault (write-only)
```

## Effort Estimate

All items are S-effort. Recommended delivery in two phases: (1) Core triad items 1-3 first (~4-5 hours CC), (2) Cherry-picked expansions 4-7 (~3-4 hours CC). Total: ~7-9 hours CC. Prompt tuning for Socratic questioning and verification is iterative and may require additional sessions beyond the initial implementation.

This is a single-file enhancement to an existing 915-line skill. No new infrastructure, no new services, no new dependencies. State persistence across chunks uses the existing `.prism/state.json` checkpointing mechanism.

## Success Criteria

1. The founder activates /prism and experiences deeper intent capture than before
2. Acceptance criteria are generated before code starts
3. After each build chunk, the founder sees a judgment checkpoint: "Does this match?"
4. Mismatches between build output and acceptance criteria are caught by the verification loop before the founder discovers them manually (fabrication detection is covered by the existing tdd-guide agent running tests)
5. Intent docs appear in Obsidian vault and are readable/searchable
6. Patrick can use /prism on a real build and says "I knew what was happening the whole time"

## Risk Assessment

| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| Socratic questioning annoys impatient users | Medium | Medium | Depth calibration (Quick mode) |
| Verification loop is too strict (false positives) | Medium | Low | User can dismiss checkpoint and proceed. First-pass comparison is LLM self-evaluation prompt; will iterate based on false positive rate in real usage |
| LLM self-evaluation unreliable | Medium | Medium | Belt-and-suspenders: LLM comparison for fast feedback + deterministic tests from tdd-guide for ground truth. Highest technical risk in this plan — plan for iteration |
| Obsidian path config is fragile | Low | Low | Sensible defaults, clear error messages |
| Changes break existing /prism behavior | Low | High | Test with existing intent.md files first |

## Relationship to Prior CEO Plan

The prior plan (2026-03-20-prism-ai-cofounder.md) described the full product vision. This plan is the **foundation** that must work before any of that vision is built. If the triad doesn't work — if Socratic questioning + verification + guidance can't prevent the 80% wall — then none of the broader vision matters.

Think of it as: Prior plan = the house. This plan = the foundation. Build the foundation first. If it holds, build the house.
78 changes: 78 additions & 0 deletions prism/planning/eng-review-2026-03-20.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Engineering Review: Prism Triad MVP
Date: 2026-03-20
Status: CLEARED (0 unresolved, 2 critical gaps mitigated)
Reviewed: CEO plan at ~/.gstack/projects/prism/ceo-plans/2026-03-20-prism-triad-mvp.md
Codex review: included (GPT-5.4, 14 issues, incorporated into review)

## 10 Decisions Locked In

### 1. Precedence hierarchy for verification
Tests are ground truth (pass/fail). LLM comparison is advisory (surfaces concerns but doesn't block). User judgment is final override (can dismiss either, but overrides are logged with reason).
```
Tests PASS + LLM OK → auto-proceed
Tests PASS + LLM flags → surface advisory to user
Tests FAIL → always block, regardless of LLM
User override → log override + reason, proceed
```

### 2. Hybrid ordering: Claude engineers, user vibes
Claude silently handles: dependency analysis, build order optimization, sub-chunk splitting, technical sequencing.
Claude asks the user (plain language only): "Which part matters most to you?" / "Should we start with what people see, or what makes everything work?"
NEVER asks engineering questions: "Auth depends on DB schema, build first?"

### 3. Two-layer acceptance criteria
- **User-facing** (acceptance-criteria.md): plain language, experience-focused. "People can sign up in under 30 seconds."
- **Machine-facing** (.prism/test-criteria.json): testable assertions Claude derives silently. "POST /signup returns 201 within 2s."
- User only ever sees user-facing layer.

### 4. Smart interrupts (not blocking checkpoints)
Prism verifies every chunk silently. Only interrupts when:
- Tests fail (after 2 silent fix attempts)
- LLM detects significant intent drift
- A feature is substantially different from what was described
Green chunks auto-proceed with a brief status message. Result: ~1-2 interruptions per build instead of 5-10.

### 5. Graceful exit for Socratic questioning
Max rounds per depth: Quick (2), Standard (5), Deep (10). At max, Prism says "I have enough to start — we'll refine as we go" and generates best-effort acceptance criteria.

### 6. State migration for existing sessions
When resuming a session without acceptance-criteria.md: generate silently from intent.md features. Without config.json: use defaults. No user interruption.

### 7. Socratic rejection UX
When user rejects a chunk with vague feedback ("it feels off"), Prism asks follow-ups: "Is it doing the wrong thing, or doing the right thing the wrong way?" / "What did you picture instead?" Claude translates vibe into engineering changes silently.

### 8. Test generation from machine-layer criteria
tdd-guide receives machine-layer assertions (testable, specific), not user-layer criteria (vibes). Translation from vibes to assertions happens once during criteria generation.

### 9. Lightweight instrumentation
Log to history.jsonl: verification outcomes (pass/fail/override), Socratic depth used, chunks rejected vs accepted, time per chunk. Not a formal eval — just enough data to learn.

### 10. Test efficiency
Generate tests once per feature during criteria generation. On verification, run existing tests inline — don't re-invoke tdd-guide. Only re-invoke if fix changes feature scope.

## Critical Gaps (mitigated)
1. **Vague machine-layer criteria** → Add self-check during generation: "Could each assertion actually fail?" (Included in Phase 1)
2. **Silent protocol export failure** → Log to history.jsonl, mention in status message.

## Phase 1 Implementation Scope (Core Triad)
Items 1-3 from CEO plan + acceptance criteria self-check:
- Deeper Socratic questioning with adaptive depth + max rounds
- Two-layer acceptance criteria generation with self-check
- Smart verification loop (silent verify, interrupt only on problems)
- Socratic rejection UX for vague feedback
- State migration for existing sessions
- Lightweight instrumentation logging

## Phase 2 Implementation Scope (Expansions)
Items 4-7 from CEO plan:
- Protocol Template Export
- Obsidian vault write-only integration
- Full verification loop (belt-and-suspenders with tdd-guide)
- Socratic Depth Calibration UI

## Files
- Test plan: ~/.gstack/projects/prism/foxy-no-branch-test-plan-20260320-230000.md
- TODOS: ~/.gstack/projects/prism/TODOS.md
- CEO plan: ~/.gstack/projects/prism/ceo-plans/2026-03-20-prism-triad-mvp.md
- Design doc: ~/.gstack/projects/prism/foxy-unknown-design-20260320-221541.md
- Existing skill: ~/.claude/skills/gstack/prism/SKILL.md (915 lines)
Loading