PlanExeOrg · neoneye · May 12, 2026 · May 12, 2026
diff --git a/docs/proposals/138-proposals.md b/docs/proposals/138-proposals.md
@@ -0,0 +1,338 @@
+# Proposals Audit — Clustering, Top 10, and Five Novel Ideas
+
+## Purpose
+
+This proposal looks across the other ~137 proposals in `docs/proposals/`, classifies them
+by current relevance, picks the ten most promising directions and explains how to take
+them to the next level, then adds five novel ideas that are not yet on the table.
+
+The intent is curatorial. The proposal folder has grown faster than it has been pruned.
+A reader landing in `docs/proposals/` today cannot tell which ideas are dead, which are
+half-built, and which are the next place to invest energy. This document is the answer
+to "what should we work on next, and which proposals can we stop re-reading?"
+
+The audit was performed by surveying every numbered proposal (00–137) plus the three
+unnumbered files (`AGENTS.md`, `ensemble-judge-refinement.md`, `parallel-racing-robustness.md`),
+cross-checking each against the codebase under `planexe/` and recent git history.
+
+---
+
+## Classification scheme
+
+Each proposal is tagged on three axes:
+
+- **STATUS** — `implemented`, `partial`, `untouched`, `superseded`, `stale`.
+  - `implemented` = the proposal's core idea is shipped.
+  - `partial` = some scaffolding exists but the proposal is not fully realised.
+  - `untouched` = no code yet.
+  - `superseded` = a newer proposal subsumes it.
+  - `stale` = the premise no longer matches the project's direction.
+- **RELEVANCE** — `high`, `medium`, `low`.
+- **THEME** — `agentic`, `finance`, `plugin`, `verification`, `ranking`, `ui`,
+  `governance`, `infra`, `prompt-quality`, `data`, `other`.
+
+---
+
+## Cluster overview
+
+### By status
+
+| Status | Count (approx.) | Reading guidance |
+|---|---:|---|
+| Implemented | ~16 | Treat as historical record. Worth reading only when chasing the rationale behind a current code path. |
+| Partial | ~24 | The interesting middle. These are stalled mid-build and most of them will benefit from a finishing push, not a fresh design. |
+| Untouched, high-relevance | ~30 | The candidate pool for next-quarter work. Most of the "Top 10" below comes from here. |
+| Untouched, medium-relevance | ~40 | Useful seed material but not the next step. Keep on the shelf. |
+| Untouched, low / stale | ~28 | Candidates for retirement (move to `docs/proposals/archive/` or just stop linking from indexes). |
+
+### By theme
+
+| Theme | Hot proposals | Status of the cluster |
+|---|---|---|
+| `verification` | 21, 22, 27, 38, 42, 43, 46, 56, 88, 102, 107, 112, 118, 123, 133, 135, ensemble-judge | Largest cluster, mostly untouched; this is the project's biggest under-invested area. |
+| `agentic` | 26, 40, 41, 60, 61, 64, 66, 69, 71, 111, 114, 120, 121, parallel-racing | Second biggest; lots of "post-plan agents" overlap that needs consolidation. |
+| `ranking` / quality | 07, 29, 68, 89–92, 96, 119, 122, 132 | 122 already shipped; 132/07 are the two ELO drafts that should be merged. |
+| `finance` | 11–14, 33–37, 44, 45, 76, 100, 105 | 11–14 are stale (investor-portfolio framing); 33–37, 44 are a coherent untouched module. |
+| `prompt-quality` | 57–59, 82, 83, 94, 117, 128, 129, 130 | Live area — the self-improve loop (117/94) is the project's actual quality engine today. |
+| `infra` | 03, 31, 50–53, 55, 70, 73–75, 79–81, 87, 93, 101, 103, 104, 108–110, 131, 134 | Most of the recent shipping work landed here; remaining items are smaller cleanups. |
+| `plugin` | 16–20, 47, 115, 127 | Conceptually solid but no momentum. Likely premature until the agentic story lands. |
+| `ui` | 02, 04, 08, 32, 62, 95, 116 | 08 shipped (home.planexe.org); 95 (routing UX modes) is the next high-leverage UI. |
+| `data` | 05, 77, 98, 105, 106, 110, 137 | 05 (semantic plan graph) is the pearl; the rest are niche. |
+| `governance` | 00, 15, 25, 30, 54, 78, 99, 136, AGENTS | Mostly meta/process docs; not blocking anything. |
+| `other` | 48, 67, 72, 84, 85, 125, 126 | The retirement bucket. |
+
+### Likely retirement candidates
+
+These are good to read once for context, then stop linking:
+
+`11-investor-thesis-matching-engine`, `12-evidence-based-founder-execution-index`,
+`13-portfolio-aware-capital-allocation`, `14-confidence-weighted-funding-auctions`
+(all assume a "PlanExe as VC infrastructure" framing the project no longer pursues),
+`28`/`29`/`30` (autonomous bid factory — same framing problem),
+`48-moltbook-reputation-bridge`, `49-distributed-physical-task-dispatch-protocol`,
+`51-decentralized-planexe-survivability`, `67-buildinpublic-twitter-automation`,
+`72-ai-replacing-c-level-roles`, `84`/`85` (business-idea critique — narrow domain),
+`99-human-utility-show-pedigree`, `100-lobster-museum-donation-routing`,
+`126-arc-agi3-structured-exploration`.
+
+### Implemented (no longer "proposals")
+
+`00-triage`, `08-ui-for-editing-plan`, `55-taskitem-activity-log-decomposition-and-secure-downloads`,
+`73-rename-task-prefix-in-mcp-tools`, `74-rename-taskitem-to-planitem`,
+`75-post-rename-cleanup-issues`, `79-multiple-api-keys`, `81-mcp-api-key-validation`,
+`87-plan-resume-mcp-tool`, `93-local-model-roadmap`, `109-llm-executor-retry-improvements`,
+`110-usage-metrics-local-runs`, `113-llm-error-traceability`, `122-deduplicate-levers`,
+`127-mcp-feedback`, `AGENTS`. These should ideally be moved into a `docs/proposals/done/`
+subfolder so the active set stays scannable.
+
+---
+
+## Top 10 promising directions
+
+Each entry below names the proposal(s) being built on, why it matters now, and a
+concrete "next level" step that would move it from idea to shipped capability.
+
+### 1. ELO ranking as the quality signal — `07`, `132`
+
+**Why now.** Without an automated scalar quality signal, every other quality-related
+proposal — prompt optimiser (117, 94), drift measurement (82, 83), ensemble judge,
+benchmark prompts (98) — has to invent its own metric. ELO solves that once.
+
+**Next level.**
+1. Pick the *unit* of comparison: full plan, single pipeline stage output, or single
+   answer to a benchmark prompt. Stage-output is the highest-leverage choice because it
+   plugs directly into self-improve.
+2. Implement a pairwise-judge primitive (`planexe/quality/elo_judge.py`) that takes two
+   stage outputs + the rubric for that stage and returns a winner with a one-paragraph
+   justification.
+3. Run it nightly across the canonical benchmark suite (98) to produce a leaderboard
+   per stage; surface that leaderboard in the self-improve runner so each prompt edit
+   gets a delta in ELO points.
+4. Once the leaderboard is stable, retire ad-hoc "iter N is better" judgements from
+   the self-improve loop.
+
+### 2. Semantic plan search graph — `05`
+
+**Why now.** PlanExe accumulates more plans every week, and the only way to find one
+today is by directory name. A pgvector index over plan text + structured metadata
+(domain, scale, lever taxonomy) unlocks: few-shot retrieval into the pipeline,
+duplicate-detection across runs, "show me plans like this one" in the UI, and the
+data substrate for novel ideas N1, N3, N5 below.
+
+**Next level.**
+1. PostgreSQL is already deployed (memory: `database_postgres`). Add a `plans` table
+   with embeddings on (initial_prompt, executive_summary, lever_set).
+2. Backfill from the plan archive on home.planexe.org; do it lazily via a Luigi task.
+3. Expose `plan_search` as an MCP tool first; ship the UI surface afterwards.
+
+### 3. Routing trio: task complexity + cache-aware handoff + UX modes — `92`, `95`, `96`
+
+**Why now.** The 26-Feb routing post-mortem (89) showed that the project pays Opus
+prices for tasks that Haiku could handle. These three proposals are different angles
+on the same problem and should be designed together, not separately.
+
+**Next level.**
+1. Use the complexity-assessment work (90, 91) to label every pipeline stage with a
+   complexity tier in `LLMConfig`.
+2. Wire 96 (cache-aware handoff) into `LLMExecutor` so a stage that needs Sonnet for
+   its hard step doesn't blow the cache built up by Haiku in the easy steps.
+3. Ship 95 (UX modes) last — it is a thin chooser on top of the routing primitive,
+   not a separate system.
+
+### 4. Autonomous prompt optimisation, fully closed-loop — `94`, `117`
+
+**Why now.** The self-improve loop documented in `MEMORY.md` is already the project's
+quality engine in practice (40 iterations on `identify_potential_levers`, 52 on
+`deduplicate_levers`). The proposal-level next step is to remove the human-in-loop
+verdict step and let it run unattended overnight.
+
+**Next level.**
+1. Replace the current "Claude reads the assessment and writes a verdict" step with
+   the ELO judge from #1 above. The judge is the only thing that needs to be trusted
+   to be hands-off.
+2. Add a budget guard (cost ceiling per night) and a regression guard (don't promote
+   a prompt that loses on >X% of benchmark prompts).
+3. Add a per-stage rotation so the loop doesn't only optimise one prompt for weeks.
+
+### 5. Post-plan agent execution — `61`, `64`, `66`, `120`, `121`
+
+**Why now.** Five proposals describe variations on "the plan should not be the end of
+the pipeline; agents should pick up the plan and start doing the work." This is the
+single biggest user-visible upgrade on the table, but the five proposals contradict
+each other on scope. They need consolidation into one architecture before any code
+gets written.
+
+**Next level.**
+1. Write a consolidating proposal (139?) that picks one execution model: Luigi-driven,
+   MCP-tool-driven, or Claude Agent SDK-driven. The Luigi-driven option dovetails
+   with 65 (git as state machine) and 101 (Luigi resume hooks).
+2. Define exactly *which* plan artifacts become agent inputs (project plan? WBS? next
+   action list?) and what "done" means for an agent step.
+3. Pilot with a single safe domain (e.g., literature-review plans) before expanding.
+
+### 6. Adversarial red-team + critical-premises gate — `56`, `123`, `135`
+
+**Why now.** PlanExe plans look authoritative whether or not they are correct. Three
+proposals describe adversarial verification: 56 generates an attacker, 135 extracts
+the load-bearing premises, 123 demands evidence calibration. Together they form a
+"premortem-as-a-pipeline-stage" capability that would catch the most embarrassing
+failure mode (confident, false plans).
+
+**Next level.**
+1. Implement 135 first: a structured extraction of "if this premise is wrong, the
+   plan collapses" → list of N premises with confidence scores. (`planexe/critical_premises/`)
+2. Layer 56 on top: a red-team agent that attacks each high-load premise and produces
+   a counter-narrative.
+3. Layer 123 last: an evidence-discipline scorer that grades how well the plan
+   defends each premise. Block low-scoring plans from finishing.
+
+### 7. Fermi sanity gate + domain-aware normaliser — `88`, `107`
+
+**Why now.** Numbers in PlanExe plans are currently un-audited. A user can get a
+$2.4M budget that quietly assumes Nairobi carpenters earn US wages. 88 proposes a
+sanity-check gate that runs Fermi estimates against extracted parameters; 107
+proposes a domain-aware normaliser. They are the same idea at two layers.
+
+**Next level.**
+1. Build the parameter extractor first (proposal 137 is a useful upstream design).
+2. Pipe extracted parameters through an order-of-magnitude check against a small
+   reference table (regional wages, material prices, conversion rates). Initial
+   table can be hard-coded; a future iteration can pull from a service (105, 106).
+3. When a parameter is >1 order of magnitude off, raise a `FermiViolation` that
+   forces a regeneration of the offending paragraph rather than a global rerun.
+
+### 8. Ensemble judge + parallel model racing — `ensemble-judge-refinement`, `parallel-racing-robustness`
+
+**Why now.** The two unnumbered proposals are both about robustness on early-pipeline
+tasks where a single model failure cascades downstream. They are complementary:
+parallel racing fixes the *availability* side (pick whichever model returns first
+with a valid answer), the ensemble judge fixes the *quality* side (when multiple
+models return, pick the best). Code in `redline_gate.py` already gestures at this.
+
+**Next level.**
+1. Add `ParallelRacer` to `LLMExecutor` for stages tagged `racing-eligible`.
+2. When >1 result comes back inside the deadline, hand them to the ELO judge from #1.
+3. Log the loser results — they are training data for prompt optimisation (#4).
+
+### 9. Finance triad — `33`, `34`, `35` (with `36`, `37`, `44` as follow-ups)
+
+**Why now.** Proposals 33 (CBS), 34 (top-down), 35 (bottom-up reconciliation) form a
+self-contained financial-modelling layer. They sit untouched while the project ships
+narrower finance ideas (100, 105, 106). The triad would close the biggest credibility
+gap in PlanExe outputs: budgets that are made of numbers, not vibes.
+
+**Next level.**
+1. Implement 33 (CBS) as a new pipeline stage that emits a structured cost tree.
+2. Use the same stage to feed both 34 (top-down) and 35 (bottom-up) and reconcile
+   them in a single report section. The reconciliation delta *is* the credibility
+   signal.
+3. Once the triad is in place, 36 (Monte Carlo) and 44 (audit pack) become small
+   add-ons rather than new modules.
+
+### 10. Multi-stage expert verification — `21`, `22`, `27`
+
+**Why now.** PlanExe currently invents experts in-prompt. These three proposals
+describe a coherent flow: discover the right expert profile (21), run multi-stage
+verification with that expert (22), and triangulate across angles (27). It would
+turn "Pretend Expert" into a structured second opinion.
+
+**Next level.**
+1. Start with 21 in narrow form: a deterministic expert-profile generator that emits
+   2–3 personas with credentials and known biases for any plan domain.
+2. Use those personas inside the existing critique/premortem stages instead of the
+   currently anonymous expert voice. This is a one-week change and immediately
+   raises plan quality.
+3. Defer 27 (multi-angle) until #1 and #2 have run on the benchmark suite for a
+   month and produced ELO data.
+
+---
+
+## Five novel ideas
+
+These are not in the existing proposal set.
+
+### N1. Plan lineage and time-travel diff
+
+Every plan run produces a snapshot today, but the snapshots aren't related. Treat
+each rerun as a commit on the same plan: store the parent run id, the prompt diff,
+and the per-stage output diff. The user can then ask "what changed when I added
+'in Nairobi' to the prompt?" and see exactly which downstream sections moved.
+
+This pairs with the prompt optimiser (117, 94) — it gives every prompt edit a
+visible diff in plan-space, not just in metric-space — and with semantic search
+(#2) which provides the join key.
+
+### N2. Counterfactual plan forking
+
+Once plans are stored as a graph (N1 + #2), let the user fork at any node and
+re-run only the downstream tasks under a "what-if" assumption ("what if the budget
+is half?", "what if the deadline slips 6 months?"). Today the user has to re-run
+the entire pipeline. Branching reruns turn PlanExe into a planning sandbox instead
+of a one-shot generator. Ties into proposal 45 (counterfactual scenario explorer)
+but reframes it as a graph operation rather than a separate engine.
+
+### N3. Replay Lab — quality drift across model generations
+
+Capture the full LLM I/O for every plan run (already partially possible via 113).
+Build a tool that takes any historical plan and re-runs *just the LLM calls*
+against today's models, using yesterday's prompts. The output is a curve of
+"quality vs. model generation" for every plan in the archive — the dataset that
+would let PlanExe answer "is Sonnet 4.7 actually better than 4.6 for our pipeline?"
+without manual labelling. Also unlocks "we shipped a regression in stage X three
+weeks ago" detection.
+
+### N4. Risk curriculum — actionable next-24h checklist
+
+Most users close the plan tab and never act on it. Add a small terminal section
+called "Before you do anything else" that surfaces the top three most fragile
+premises (from #6) as questions the user should answer in 24 hours, with the
+exact source the user could check (e.g., "Confirm with city hall: is permit X
+required for buildings under 200 m²?"). Converts a long plan into one
+short-feedback loop. Scope is small; impact on user follow-through is plausibly
+large.
+
+### N5. Federated lever library
+
+Levers (the dedup pipeline, proposal 122) are extracted from each plan and then
+discarded after that run. Opt users into a shared library keyed by domain +
+scale, so the dedup stage can borrow patterns: "for renewable-energy plans at
+city scale, the canonical lever set is roughly these eight." This compounds with
+every plan written and is a moat the project doesn't currently have. Privacy
+boundary: only lever taxonomies and abstract patterns are shared, never the
+user's prompt or specific numbers.
+
+---
+
+## Conclusion
+
+The proposal folder reads like an idea graveyard partly because nobody is
+allowed to bury anything. The first practical action this audit recommends is
+operational, not technical: move the 16 implemented proposals into a `done/`
+subfolder and the ~14 stale ones into an `archive/` subfolder, so the active
+working set is the ~80 proposals that still represent genuine choices.
+
+Once the noise is gone, the signal is sharp. Five clusters dominate the
+"untouched but high-relevance" set:
+
+1. **Quality measurement** — ELO judge as the underlying scalar (#1), used by
+   the prompt optimiser (#4), the ensemble judge (#8), and the routing tier (#3).
+2. **Verification** — adversarial red-team and critical-premise gates (#6),
+   Fermi sanity check (#7), and structured expert personas (#10).
+3. **Agentic execution** — the five overlapping post-plan-agents proposals
+   collapsed into one design (#5).
+4. **Knowledge substrate** — the semantic plan graph (#2), which is the
+   precondition for novel ideas N1–N5.
+5. **Financial credibility** — the CBS / top-down / bottom-up triad (#9),
+   the only cluster that addresses the "PlanExe budgets aren't auditable"
+   weakness.
+
+The two recurring failure modes in the existing proposals are: ideas that
+arrived before their preconditions (most plugin proposals; most agentic
+proposals before there was an executor), and ideas that drift in scope until
+they need their own quarter (the bid-factory / portfolio-VC framing). The
+remedy on both sides is the same — finish the substrate work (ELO, semantic
+graph, post-plan agent architecture) before opening the next batch.
+
+The five novel ideas are deliberately built on top of that substrate rather
+than alongside it, so any investment in #1, #2, and #5 above pays for N1–N5
+as well.