Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
212 changes: 157 additions & 55 deletions experiments/napkin_math/docs/20260522_plan.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# 2026-05-22 — Plan: addressing Gemini's Monte Carlo methodology critique

*Originally drafted 2026-05-20; renamed and refreshed 2026-05-22 to track the post-#753 ship-set.*
*Originally drafted 2026-05-20; renamed and refreshed 2026-05-22 to track the post-#753 ship-set. Reset again after the v58 corpus review so the next work is finite and regression-gated.*

## Context

Expand Down Expand Up @@ -72,6 +72,105 @@ decomposition problem: the plan names the deterministic equation
explicitly, and the extract should preserve it. The two themes are
separated below.

## 2026-05-22 reset after the v58 corpus review

The original Gemini critique was small: fix disconnected aggregates,
target-centered bounds, independence, bounded tails, and hallucinated
citations. The work then expanded because each prompt fix exposed a new
class of uncaught regression. That is useful evidence, but it also means
the plan has drifted from "close Gemini's critique" into "keep improving
napkin_math until it feels done." That is not a usable stop condition.

The immediate problem is not that Phase 5 / 8 / 9 are unfinished. The
larger problem is that the corpus can pass validation while silently
dropping prior signals and threshold tests. Until that is fixed, more
methodology work is hard to evaluate because two snapshots are not
comparable.

### What went wrong

- **Prompt-side rules landed faster than orchestration and gates.**
PR #753 tells extract what to do when a prior baseline is supplied,
but the v58 extract inputs did not contain a `# Prior Signal Ledger`.
The discipline existed; the run did not use it.
- **The source-preservation audit is implemented but not operational.**
It can classify prior-vs-current losses, but it is advisory, not wired
into the pipeline, and not used as an acceptance gate.
- **Monte Carlo threshold selection is still informal.** The pipeline
can validate and run even when many executable calculation outputs are
not threshold-tested. That makes verdict movement difficult to trust.
- **Several "DONE" labels were too optimistic.** The code or prompt
piece may be merged, but the user-visible protection is incomplete
until the pipeline supplies the needed context and fails on regressions.

### Observed v49 -> v58 regressions

All compared snapshot plan directories point at the same source zip
commit ids. Differences below are therefore pipeline/output changes, not
input-document changes.

Source-preservation audit, v49 prior to v58 current:

| Class | Count |
|---|---:|
| prior signals | 301 |
| preserved by id | 42 |
| preserved by output_name | 1 |
| preserved as formula dependency | 0 |
| explained_drop | 0 |
| likely_renamed | 134 |
| absent_unexplained | 124 |

This is the central regression signal. Even if many `likely_renamed`
items are legitimate, 124 unexplained absences and 0 explained drops
show that Proposal 141 has not yet protected the run.

Threshold coverage also regressed:

| Snapshot | Monte Carlo thresholds |
|---|---:|
| 46 | 61 |
| 49 | 70 |
| 58 | 44 |

The v58 files still generally contain five recommended calculations per
plan, but many outputs are intermediate calculations rather than gates.
Across the corpus, calculated outputs stayed roughly flat while
threshold-tested outputs fell:

| Snapshot | Calculated outputs | Threshold-tested outputs | Untested outputs |
|---|---:|---:|---:|
| 49 | 83 | 70 | 12 |
| 58 | 82 | 44 | 37 |

Examples seen in v58: `4DWW_India` computes `sme_share_margin` and
`optin_margin` in `derived_questions`, but only two thresholds are run;
`crate_recovery_campaign` leaves `volume_target_surplus`,
`foundation_donation_margin_dkk`, and `recovery_rate_margin` untested;
`datacenter_in_france` leaves intermediate cost and FX outputs untested.
Some untested outputs are valid intermediates, but the current artifact
does not make that distinction explicit enough for automation.

Warnings improved in one narrow sense: snapshot 49 had 9 Monte Carlo
warnings for functions not found in `calculations.py`; snapshot 58 has 4
warnings, all about declared correlations being ignored until the
Gaussian-copula sampler exists. That is progress on missing functions,
but the remaining warnings are still real model understatement.

### New stop condition

Do not judge the next run by "all 14 validate" or "all Monte Carlo runs
complete." The next accepted snapshot needs at least these properties:

- Prior-baseline context is actually supplied to extract.
- Source-preservation audit runs v(N-1) -> v(N) and reports no large
unexplained signal loss.
- Threshold count and gate coverage cannot collapse without an explicit
explanation.
- Monte Carlo warnings are reported as acceptance evidence, not buried
in per-plan logs.
- Verdict changes are interpreted only after the above checks pass.

## Status as of 2026-05-22

### Landed on main
Expand Down Expand Up @@ -246,10 +345,10 @@ too:
| Phase | Skill / module | Status |
|---|---|---|
| 1 | `compress_report_section.py` | **DONE on main via PR #737 + PR #743 + PR #744 + PR #750** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner; per-bucket emission-layer second pass for run-to-run variance; paraphrase-tolerant quote verification on the ranking layer; cross-bucket promoter for gate-shaped tripwires misfiled under `risks_and_shocks`). |
| 2 | `extract-parameters-from-{full,digest}` | **DONE for prompt-side directives on main via PR #740 + PR #753** — threshold-pairing on `from-digest` shipped in PR #737; source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline shipped in PR #740; prior-signal ledger discipline (PR #753) added so the LLM records every prior id/output_name absence in `dropped_signals` when a prior baseline is in context. Behavioural validation on a different LLM and orchestrator-side prior-baseline injection both remain follow-ups. |
| 2 | `extract-parameters-from-{full,digest}` | **PROMPT DIRECTIVES LANDED, NOT FULLY OPERATIONAL** via PR #740 + PR #753 — threshold-pairing on `from-digest` shipped in PR #737; source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline shipped in PR #740; prior-signal ledger discipline (PR #753) added so the LLM records every prior id/output_name absence in `dropped_signals` when a prior baseline is in context. v58 showed the missing piece: the run did not actually inject the prior ledger into extract, so the prompt rule had no input. Behavioural validation on a different LLM and orchestrator-side prior-baseline injection both remain follow-ups. |
| 3 | `validate-parameters` | **DONE on main via PR #746 + PR #752** (R1.1 `aggregate_not_bounded`, R2.5 `requirement_has_margin`, and proposal-141 `dropped_signals_schema` structural checks; total now 19). The `sampling_discipline` enum expansion bullet that the original plan tucked here actually lives in `run_monte_carlo.py` and was routed to Phase 4 (PR #747) to avoid a silent shim. |
| 4 | `generate-bounds` | **DONE on main via PR #747 + PR #749** — runtime, schema readiness, and the four prompt-side LLM rules (R1.2 base-anchoring source-tag asymmetry, R1.5 citation context-leak self-audit, R2.2 megaproject CAPEX lognormal default (criterion-based), R1.3 correlations selection rules). `lognormal` / `pert` remain schema-reserved with `NotImplementedError` at sample time; the Gaussian-copula sampler ships in Phase 8. `pert` carries no selection rule yet. |
| 4.5 | source-preservation audit | **DONE on main via PR #751 + PR #752 + PR #753** (proposal 141). Standalone Python `audit_source_preservation.py` plus `dropped_signals` schema in extract + validator + audit. Advisory only; Fork A (source-digest regex scan) and orchestrator-side prior-baseline injection are explicit follow-ups. |
| 4.5 | source-preservation audit | **IMPLEMENTED, NOT AN ACCEPTANCE GATE** via PR #751 + PR #752 + PR #753 (proposal 141). Standalone Python `audit_source_preservation.py` plus `dropped_signals` schema in extract + validator + audit. v58 exposed the gap: advisory output alone did not stop 124 prior signals from becoming absent/unexplained. Still pending: strict-mode policy, run wiring, Fork A (source-digest regex scan), and orchestrator-side prior-baseline injection. |
| 5 | `verify-bounds-citations` (new) | not started |
| 6 | `generate-calculations` | no change required per the original plan |
| 7 | `run-scenarios` | not started |
Expand All @@ -276,6 +375,12 @@ same-LLM regeneration cannot distinguish "the pipeline is more honest"
from "the LLM that wrote the rules is applying them consistently."
Different-LLM validation remains an open follow-up.

After the v58-vs-v49 audit, this snapshot should be treated as a
diagnostic artifact, not the new accepted baseline. The validator and
runner succeeded, but the corpus-level checks above show major signal
and threshold coverage loss. The next accepted snapshot needs to prove
that those losses are explained or fixed.

Two recurring sub-agent observations during the run:

- **Megaproject lognormal default fires.** Multiple agents selected
Expand All @@ -291,58 +396,55 @@ Two recurring sub-agent observations during the run:
tail risk is structurally understated for those plans until the
copula sampler ships.

### Next likely move

With proposal 141 (PR #751 + #752 + #753) shipped, the immediate
backlog is reordered. The remaining work is roughly:

1. **Phase 5 — `verify-bounds-citations`** (new deterministic step,
R1.5 backstop). Parse Risk / Issue / Decision N tokens from each
rationale string in bounds.json; fetch the corresponding section
from the source report; compare topical keywords. Fail-loud /
re-run-`generate-bounds` posture. The R1.5 self-audit prompt rule
that landed in PR #749 reduces the citation context-leak surface,
but a deterministic post-processor is the right guardrail since
the LLM is asked to self-audit its own output. This is now the
single largest unrunshipped guardrail in the pipeline.
2. **Phase 8 samplers — Gaussian copula + lognormal/PERT** in
`run_monte_carlo.py`. The schema and the loud-failure posture
landed in PR #747/#749; the v58 run produced concrete cases where
both gaps now bite (megaproject CAPEX fallback to triangular;
declared correlations warning rather than sampling). Order:
lognormal first (single-variable, no copula dependency), then
PERT, then Gaussian-copula for declared groups. Each ships as
its own PR.
3. **Phase 9 — composite-band cap on `summarize-assessment`** (R2.1
"megaproject illusion"). The cap rule has a calibration
dependency on the actual spread of `unmodelled_gates.length`
across the v58 corpus; the empirical snapshot above provides
that calibration surface for the first time. Suggested next move:
tabulate the v58 corpus's `unmodelled_gates.length` distribution
and pick K thresholds (placeholders 1–2 / 3–4 / 5+) against the
observed spread.
4. **Different-LLM behavioural validation** of the rules now on main.
A Self-Improve run with the default napkin_math LLM (Gemini Flash
Lite) against the v58 digests would close the same-LLM same-
session confound. Especially load-bearing for the PR #749 rules
(worked example, megaproject criterion, correlations selection)
and the PR #753 prior-signal ledger rule. Validation of prompt
generality, not a quality fix.
5. **Proposal 141 follow-ups**: Fork A (source-digest regex scan
against the current artifact, distinct from Fork B's prior-vs-
current diff) and the orchestrator-side wiring that injects the
prior baseline into the extract call (without it the LLM cannot
actually emit `prior_baseline`-origin drops, so PR #753's
discipline runs without input).
6. **Prompt-hygiene pass** for the remaining domain-specific
examples (e.g. `european_prepper_active_buyers`) in either
extract prompt. Worthwhile and small, but not load-bearing for
the currently observed napkin_math failures.

These are separate PRs. Each ships independently; bundling
verification, sampler implementation, composite-band cap, behavioural
validation, audit follow-ups, or prompt hygiene into one PR would
obscure which piece moved which metric.
### Path forward: stop silent regressions first

The next work should be reordered around comparability. Do not start
Phase 5 / 8 / 9 as the next PR unless the regression gates below are
already in place; otherwise the next corpus run can improve one method
detail while silently losing different gates or signals.

1. **Wire prior-baseline context into extract.** Update the napkin_math
run path so Stage 0/Extract can receive the prior snapshot's
`parameters.json` and include the `# Prior Signal Ledger` context
that PR #753 expects. Update the run skill/docs at the same time so
future manual runs use the same command shape. Acceptance evidence:
v(N) extract inputs visibly include prior-ledger context.
2. **Make source-preservation audit strict enough to block regressions.**
Add a strict mode or wrapper that compares prior -> current and exits
non-zero on unexplained loss beyond an explicit threshold. The first
strict policy can be conservative: fail on any prior threshold output
disappearing without `explained_drop`, and fail on large aggregate
`absent_unexplained` counts. Acceptance evidence: v49 -> v58 fails
under the new policy, and the next regenerated snapshot either
preserves, explains, or intentionally waives the losses.
3. **Make Monte Carlo gate selection deterministic.** Replace the
hand-written/heuristic threshold list with either an explicit schema
on calculations (`is_gate`, `threshold_operator`, `threshold_value`,
`threshold_basis`, `intermediate`), or a deterministic builder that
selects all positive-pass outputs unless explicitly marked
intermediate. The schema path is better because it forces extract to
say whether an output is a gate or a helper. Acceptance evidence:
threshold count cannot collapse from 70 to 44 without named
intermediate/drop explanations.
4. **Add a corpus regression report for the 14-plan baseline.** One
command should summarize threshold counts by plan, prior-signal
preservation classes, untested executable outputs, Monte Carlo
warnings, verdict movement, base/target failures, and unmodelled-gate
counts. This report becomes the PR review surface; per-plan logs are
too easy to miss.
5. **Regenerate a new snapshot only after 1-4.** Treat v58 as the
failing diagnostic run. The next accepted snapshot should have the
same source zips, the new gates enabled, and an explicit
before/after report against v49 and v58.
6. **Resume the original Gemini work in finite order.** Once the corpus
is regression-gated, continue with the bounded methodology backlog:
citation verifier (R1.5), lognormal sampler, PERT sampler,
Gaussian-copula sampler, composite unmodelled-gate cap (R2.1),
different-LLM validation, then prompt hygiene.

Each item above should ship independently. The goal is not to finish all
possible napkin_math improvements; it is to make the pipeline
falsifiable enough that future improvements can be trusted.

## Per-theme mapping

Expand Down