fix(ce-plan): compress synthesis confirmation to prose + call-outs#819
Merged
Conversation
The synthesis gate before research/plan-write was producing too much volume for users to weigh in on — full Stated/Inferred/Out buckets with 15+ bullets even when granularity rules were followed. Restructure into a two-stage shape: - Stage 1 (internal): the three-bucket draft the agent uses to think comprehensively; routes into plan body sections as before - Stage 2 (chat-time): 1-3 line prose summary plus 0-3 "Call outs" — only the forks where another reasonable agent might choose differently Add a keep test (affirmability + four categories: real fork, non-obvious behavioral choice, non-obvious exclusion, cheap-now-expensive-later) that gates each call-out. Cap is tiered by plan depth with a hard 6 ceiling — above that, re-cut at higher abstraction rather than raising the cap. When zero call-outs survive, skip the blocking question and emit a mandatory auto-proceed announcement so the agent never proceeds silently. Tighten granularity rules with an explicit anti-pattern list for call-outs (file paths, flags, JSON shapes, HTTP codes, implementation flow) and surface the affirmability test in the SKILL.md phase stubs so it loads reliably. Soft-cut now tracks call-outs by decision dimension rather than surface wording so re-cuts don't reset revision counts. Headless mode unchanged in routing — internal draft still dissolves into Requirements / Assumptions / Scope Boundaries.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 63a175bc06
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Align top-level call-out count statements with the tiered cap table. The two-stage shape paragraph, stage-2 structure list, and synthesis- as-plan-pitch anti-pattern all deferred to the table under How many call-outs are right? so the agent receives a single deterministic limit (Lightweight: 0-3, Standard: 1-4, Deep: 2-6).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The synthesis gate before research/plan-write now surfaces only the decisions worth weighing in on: a 1-3 line prose summary plus 0-3 "Call outs" instead of the full Stated/Inferred/Out audit. When the prose fully captures the scope with no forks to flag, the gate skips entirely and the agent announces it's auto-proceeding.
Before, the confirmation regularly produced 15+ bullets for a moderate plan, even when granularity rules were followed. The volume made approval feel like rubber-stamping rather than a real checkpoint.
What changed
The synthesis is now a two-stage shape:
## Assumptionsin headless, Scope Boundaries).Each candidate call-out passes an affirmability test (can the user evaluate this without reading code?) and one of four categories: real fork, non-obvious behavioral choice, non-obvious exclusion, or cheap-now-expensive-later correction. Mechanical bets and implementation-flow specifics are cut before reaching chat.
Design decisions
Cap is tiered with a hard 6 ceiling that triggers re-cut, not cap-raising. Lightweight tops out at 3, Standard at 4, Deep at 6. Above 6, the synthesis is misshapen. Usually 2-3 of the call-outs are sub-decisions of one larger fork, so the rule directs the agent to collapse to higher abstraction rather than raise the cap.
Conditional skip is allowed but never silent. When zero call-outs survive, the agent emits a mandatory "Planning: [prose]. No open decisions for you to weigh in on, proceeding to [next phase]" announcement. The "why" must be visible. A default-to-keep rule on borderline call-outs keeps the failure mode bounded.
The affirmability test surfaces in SKILL.md, not only in the reference. The Phase 0.7 and 5.1.5 stubs carry the test inline so it loads on every invocation, even when the reference loads shallow. References can be skipped; SKILL.md is always loaded.
Soft-cut tracks call-outs by decision dimension, not surface wording. Stage 2 re-derivation after a revision can rephrase or merge call-outs, so identity needs to be the underlying fork. When a re-cut collapses multiple call-outs into one, the combined call-out inherits the "touched" status of any constituent.
Headless mode is unchanged in routing. The internal draft still dissolves into Requirements / Assumptions / Scope Boundaries. Stage 2 is moot when there's no synchronous user.
Test plan
bun testandbun run release:validatepass.Behavioral changes to skill prose are not exercised by automated tests. Verify by running
/ce-planon: