PlanExeOrg · neoneye · May 21, 2026 · May 21, 2026
diff --git a/experiments/napkin_math/docs/20260522_plan.md b/experiments/napkin_math/docs/20260522_plan.md
@@ -1,6 +1,6 @@
 # 2026-05-22 — Plan: addressing Gemini's Monte Carlo methodology critique
 
-*Originally drafted 2026-05-20; renamed and refreshed 2026-05-22 to track the post-#753 ship-set.*
+*Originally drafted 2026-05-20; renamed and refreshed 2026-05-22 to track the post-#753 ship-set. Reset again after the v58 corpus review so the next work is finite and regression-gated.*
 
 ## Context
 
@@ -72,6 +72,105 @@ decomposition problem: the plan names the deterministic equation
 explicitly, and the extract should preserve it. The two themes are
 separated below.
 
+## 2026-05-22 reset after the v58 corpus review
+
+The original Gemini critique was small: fix disconnected aggregates,
+target-centered bounds, independence, bounded tails, and hallucinated
+citations. The work then expanded because each prompt fix exposed a new
+class of uncaught regression. That is useful evidence, but it also means
+the plan has drifted from "close Gemini's critique" into "keep improving
+napkin_math until it feels done." That is not a usable stop condition.
+
+The immediate problem is not that Phase 5 / 8 / 9 are unfinished. The
+larger problem is that the corpus can pass validation while silently
+dropping prior signals and threshold tests. Until that is fixed, more
+methodology work is hard to evaluate because two snapshots are not
+comparable.
+
+### What went wrong
+
+- **Prompt-side rules landed faster than orchestration and gates.**
+  PR #753 tells extract what to do when a prior baseline is supplied,
+  but the v58 extract inputs did not contain a `# Prior Signal Ledger`.
+  The discipline existed; the run did not use it.
+- **The source-preservation audit is implemented but not operational.**
+  It can classify prior-vs-current losses, but it is advisory, not wired
+  into the pipeline, and not used as an acceptance gate.
+- **Monte Carlo threshold selection is still informal.** The pipeline
+  can validate and run even when many executable calculation outputs are
+  not threshold-tested. That makes verdict movement difficult to trust.
+- **Several "DONE" labels were too optimistic.** The code or prompt
+  piece may be merged, but the user-visible protection is incomplete
+  until the pipeline supplies the needed context and fails on regressions.
+
+### Observed v49 -> v58 regressions
+
+All compared snapshot plan directories point at the same source zip
+commit ids. Differences below are therefore pipeline/output changes, not
+input-document changes.
+
+Source-preservation audit, v49 prior to v58 current:
+
+| Class | Count |
+|---|---:|
+| prior signals | 301 |
+| preserved by id | 42 |
+| preserved by output_name | 1 |
+| preserved as formula dependency | 0 |
+| explained_drop | 0 |
+| likely_renamed | 134 |
+| absent_unexplained | 124 |
+
+This is the central regression signal. Even if many `likely_renamed`
+items are legitimate, 124 unexplained absences and 0 explained drops
+show that Proposal 141 has not yet protected the run.
+
+Threshold coverage also regressed:
+
+| Snapshot | Monte Carlo thresholds |
+|---|---:|
+| 46 | 61 |
+| 49 | 70 |
+| 58 | 44 |
+
+The v58 files still generally contain five recommended calculations per
+plan, but many outputs are intermediate calculations rather than gates.
+Across the corpus, calculated outputs stayed roughly flat while
+threshold-tested outputs fell:
+
+| Snapshot | Calculated outputs | Threshold-tested outputs | Untested outputs |
+|---|---:|---:|---:|
+| 49 | 83 | 70 | 12 |
+| 58 | 82 | 44 | 37 |
+
+Examples seen in v58: `4DWW_India` computes `sme_share_margin` and
+`optin_margin` in `derived_questions`, but only two thresholds are run;
+`crate_recovery_campaign` leaves `volume_target_surplus`,
+`foundation_donation_margin_dkk`, and `recovery_rate_margin` untested;
+`datacenter_in_france` leaves intermediate cost and FX outputs untested.
+Some untested outputs are valid intermediates, but the current artifact
+does not make that distinction explicit enough for automation.
+
+Warnings improved in one narrow sense: snapshot 49 had 9 Monte Carlo
+warnings for functions not found in `calculations.py`; snapshot 58 has 4
+warnings, all about declared correlations being ignored until the
+Gaussian-copula sampler exists. That is progress on missing functions,
+but the remaining warnings are still real model understatement.
+
+### New stop condition
+
+Do not judge the next run by "all 14 validate" or "all Monte Carlo runs
+complete." The next accepted snapshot needs at least these properties:
+
+- Prior-baseline context is actually supplied to extract.
+- Source-preservation audit runs v(N-1) -> v(N) and reports no large
+  unexplained signal loss.
+- Threshold count and gate coverage cannot collapse without an explicit
+  explanation.
+- Monte Carlo warnings are reported as acceptance evidence, not buried
+  in per-plan logs.
+- Verdict changes are interpreted only after the above checks pass.
+
 ## Status as of 2026-05-22
 
 ### Landed on main
@@ -246,10 +345,10 @@ too:
 | Phase | Skill / module | Status |
 |---|---|---|
 | 1 | `compress_report_section.py` | **DONE on main via PR #737 + PR #743 + PR #744 + PR #750** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner; per-bucket emission-layer second pass for run-to-run variance; paraphrase-tolerant quote verification on the ranking layer; cross-bucket promoter for gate-shaped tripwires misfiled under `risks_and_shocks`). |
-| 2 | `extract-parameters-from-{full,digest}` | **DONE for prompt-side directives on main via PR #740 + PR #753** — threshold-pairing on `from-digest` shipped in PR #737; source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline shipped in PR #740; prior-signal ledger discipline (PR #753) added so the LLM records every prior id/output_name absence in `dropped_signals` when a prior baseline is in context. Behavioural validation on a different LLM and orchestrator-side prior-baseline injection both remain follow-ups. |
+| 2 | `extract-parameters-from-{full,digest}` | **PROMPT DIRECTIVES LANDED, NOT FULLY OPERATIONAL** via PR #740 + PR #753 — threshold-pairing on `from-digest` shipped in PR #737; source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline shipped in PR #740; prior-signal ledger discipline (PR #753) added so the LLM records every prior id/output_name absence in `dropped_signals` when a prior baseline is in context. v58 showed the missing piece: the run did not actually inject the prior ledger into extract, so the prompt rule had no input. Behavioural validation on a different LLM and orchestrator-side prior-baseline injection both remain follow-ups. |
 | 3 | `validate-parameters` | **DONE on main via PR #746 + PR #752** (R1.1 `aggregate_not_bounded`, R2.5 `requirement_has_margin`, and proposal-141 `dropped_signals_schema` structural checks; total now 19). The `sampling_discipline` enum expansion bullet that the original plan tucked here actually lives in `run_monte_carlo.py` and was routed to Phase 4 (PR #747) to avoid a silent shim. |
 | 4 | `generate-bounds` | **DONE on main via PR #747 + PR #749** — runtime, schema readiness, and the four prompt-side LLM rules (R1.2 base-anchoring source-tag asymmetry, R1.5 citation context-leak self-audit, R2.2 megaproject CAPEX lognormal default (criterion-based), R1.3 correlations selection rules). `lognormal` / `pert` remain schema-reserved with `NotImplementedError` at sample time; the Gaussian-copula sampler ships in Phase 8. `pert` carries no selection rule yet. |
-| 4.5 | source-preservation audit | **DONE on main via PR #751 + PR #752 + PR #753** (proposal 141). Standalone Python `audit_source_preservation.py` plus `dropped_signals` schema in extract + validator + audit. Advisory only; Fork A (source-digest regex scan) and orchestrator-side prior-baseline injection are explicit follow-ups. |
+| 4.5 | source-preservation audit | **IMPLEMENTED, NOT AN ACCEPTANCE GATE** via PR #751 + PR #752 + PR #753 (proposal 141). Standalone Python `audit_source_preservation.py` plus `dropped_signals` schema in extract + validator + audit. v58 exposed the gap: advisory output alone did not stop 124 prior signals from becoming absent/unexplained. Still pending: strict-mode policy, run wiring, Fork A (source-digest regex scan), and orchestrator-side prior-baseline injection. |
 | 5 | `verify-bounds-citations` (new) | not started |
 | 6 | `generate-calculations` | no change required per the original plan |
 | 7 | `run-scenarios` | not started |
@@ -276,6 +375,12 @@ same-LLM regeneration cannot distinguish "the pipeline is more honest"
 from "the LLM that wrote the rules is applying them consistently."
 Different-LLM validation remains an open follow-up.
 
+After the v58-vs-v49 audit, this snapshot should be treated as a
+diagnostic artifact, not the new accepted baseline. The validator and
+runner succeeded, but the corpus-level checks above show major signal
+and threshold coverage loss. The next accepted snapshot needs to prove
+that those losses are explained or fixed.
+
 Two recurring sub-agent observations during the run:
 
 - **Megaproject lognormal default fires.** Multiple agents selected
@@ -291,58 +396,55 @@ Two recurring sub-agent observations during the run:
   tail risk is structurally understated for those plans until the
   copula sampler ships.
 
-### Next likely move
-
-With proposal 141 (PR #751 + #752 + #753) shipped, the immediate
-backlog is reordered. The remaining work is roughly:
-
-1. **Phase 5 — `verify-bounds-citations`** (new deterministic step,
-   R1.5 backstop). Parse Risk / Issue / Decision N tokens from each
-   rationale string in bounds.json; fetch the corresponding section
-   from the source report; compare topical keywords. Fail-loud /
-   re-run-`generate-bounds` posture. The R1.5 self-audit prompt rule
-   that landed in PR #749 reduces the citation context-leak surface,
-   but a deterministic post-processor is the right guardrail since
-   the LLM is asked to self-audit its own output. This is now the
-   single largest unrunshipped guardrail in the pipeline.
-2. **Phase 8 samplers — Gaussian copula + lognormal/PERT** in
-   `run_monte_carlo.py`. The schema and the loud-failure posture
-   landed in PR #747/#749; the v58 run produced concrete cases where
-   both gaps now bite (megaproject CAPEX fallback to triangular;
-   declared correlations warning rather than sampling). Order:
-   lognormal first (single-variable, no copula dependency), then
-   PERT, then Gaussian-copula for declared groups. Each ships as
-   its own PR.
-3. **Phase 9 — composite-band cap on `summarize-assessment`** (R2.1
-   "megaproject illusion"). The cap rule has a calibration
-   dependency on the actual spread of `unmodelled_gates.length`
-   across the v58 corpus; the empirical snapshot above provides
-   that calibration surface for the first time. Suggested next move:
-   tabulate the v58 corpus's `unmodelled_gates.length` distribution
-   and pick K thresholds (placeholders 1–2 / 3–4 / 5+) against the
-   observed spread.
-4. **Different-LLM behavioural validation** of the rules now on main.
-   A Self-Improve run with the default napkin_math LLM (Gemini Flash
-   Lite) against the v58 digests would close the same-LLM same-
-   session confound. Especially load-bearing for the PR #749 rules
-   (worked example, megaproject criterion, correlations selection)
-   and the PR #753 prior-signal ledger rule. Validation of prompt
-   generality, not a quality fix.
-5. **Proposal 141 follow-ups**: Fork A (source-digest regex scan
-   against the current artifact, distinct from Fork B's prior-vs-
-   current diff) and the orchestrator-side wiring that injects the
-   prior baseline into the extract call (without it the LLM cannot
-   actually emit `prior_baseline`-origin drops, so PR #753's
-   discipline runs without input).
-6. **Prompt-hygiene pass** for the remaining domain-specific
-   examples (e.g. `european_prepper_active_buyers`) in either
-   extract prompt. Worthwhile and small, but not load-bearing for
-   the currently observed napkin_math failures.
-
-These are separate PRs. Each ships independently; bundling
-verification, sampler implementation, composite-band cap, behavioural
-validation, audit follow-ups, or prompt hygiene into one PR would
-obscure which piece moved which metric.
+### Path forward: stop silent regressions first
+
+The next work should be reordered around comparability. Do not start
+Phase 5 / 8 / 9 as the next PR unless the regression gates below are
+already in place; otherwise the next corpus run can improve one method
+detail while silently losing different gates or signals.
+
+1. **Wire prior-baseline context into extract.** Update the napkin_math
+   run path so Stage 0/Extract can receive the prior snapshot's
+   `parameters.json` and include the `# Prior Signal Ledger` context
+   that PR #753 expects. Update the run skill/docs at the same time so
+   future manual runs use the same command shape. Acceptance evidence:
+   v(N) extract inputs visibly include prior-ledger context.
+2. **Make source-preservation audit strict enough to block regressions.**
+   Add a strict mode or wrapper that compares prior -> current and exits
+   non-zero on unexplained loss beyond an explicit threshold. The first
+   strict policy can be conservative: fail on any prior threshold output
+   disappearing without `explained_drop`, and fail on large aggregate
+   `absent_unexplained` counts. Acceptance evidence: v49 -> v58 fails
+   under the new policy, and the next regenerated snapshot either
+   preserves, explains, or intentionally waives the losses.
+3. **Make Monte Carlo gate selection deterministic.** Replace the
+   hand-written/heuristic threshold list with either an explicit schema
+   on calculations (`is_gate`, `threshold_operator`, `threshold_value`,
+   `threshold_basis`, `intermediate`), or a deterministic builder that
+   selects all positive-pass outputs unless explicitly marked
+   intermediate. The schema path is better because it forces extract to
+   say whether an output is a gate or a helper. Acceptance evidence:
+   threshold count cannot collapse from 70 to 44 without named
+   intermediate/drop explanations.
+4. **Add a corpus regression report for the 14-plan baseline.** One
+   command should summarize threshold counts by plan, prior-signal
+   preservation classes, untested executable outputs, Monte Carlo
+   warnings, verdict movement, base/target failures, and unmodelled-gate
+   counts. This report becomes the PR review surface; per-plan logs are
+   too easy to miss.
+5. **Regenerate a new snapshot only after 1-4.** Treat v58 as the
+   failing diagnostic run. The next accepted snapshot should have the
+   same source zips, the new gates enabled, and an explicit
+   before/after report against v49 and v58.
+6. **Resume the original Gemini work in finite order.** Once the corpus
+   is regression-gated, continue with the bounded methodology backlog:
+   citation verifier (R1.5), lognormal sampler, PERT sampler,
+   Gaussian-copula sampler, composite unmodelled-gate cap (R2.1),
+   different-LLM validation, then prompt hygiene.
+
+Each item above should ship independently. The goal is not to finish all
+possible napkin_math improvements; it is to make the pipeline
+falsifiable enough that future improvements can be trusted.
 
 ## Per-theme mapping