feat(ce-compound): add headless mode:autofix with depth selector by benwilson · Pull Request #662 · EveryInc/compound-engineering-plugin

benwilson · 2026-04-23T23:11:08Z

Summary

Adds a headless affordance to /ce-compound for programmatic callers such as post-merge hooks, CI jobs, and orchestrator pipelines.

mode:autofix enables unattended operation. It suppresses the mode-selection prompt, session-history opt-in, Discoverability consent, Phase 2.5 selective-refresh prompt, and the post-run "What's next?" menu.
depth:lightweight|standard|deep selects the unattended execution depth and defaults to lightweight.
- depth:lightweight -> single-pass documentation, no subagents, no overlap check.
- depth:standard -> Phase 1 parallel research, overlap-update behavior, and Phase 3 specialized reviewers. Session history is off.
- depth:deep -> Standard path plus Session Historian dispatch.

depth: is parsed only when mode:autofix is present. Invalid autofix depth values halt before subagent dispatch with ce-compound failed. Reason: unknown depth:<value>. Valid values: lightweight, standard, deep.

Motivation

ce-compound previously required interactive prompts, which made subprocess invocation unreliable in unattended contexts. A call such as claude -p "/ce-compound ..." or codex exec "Run /ce-compound" --ephemeral could hang or degrade because there was no user available to answer the documented fallback prompt.

Sibling skills in this plugin already have explicit unattended modes (ce-compound-refresh uses mode:autofix; ce-code-review supports mode:autofix, mode:headless, and mode:report-only; ce-doc-review supports mode:headless). This PR brings ce-compound in line with that pattern.

Output Contract

The full parser-facing contract lives at plugins/compound-engineering/skills/ce-compound/references/autofix-output-parse-contract.md; it is not loaded by the skill at runtime.

Reliable anchors

Search anywhere in stdout — first-line discipline does not hold; the agent prepends narration before any template block.

Shape detector substrings:

Anchor	Shape	Meaning
`✓ Documentation complete`	success-complete	A doc was written. The `(mode:autofix)` suffix is unreliable on this line; see best-effort.
`✓ No documentation written (mode:autofix)`	no-op	Preconditions unmet; nothing written.

No-op block supporting anchors (present when the no-op detector substring appears):

Anchor	Meaning
`Depth:`	Execution path taken (`lightweight\|standard\|deep`).
`Reason:`	One-line rationale; may continue on indented lines.
`Context considered:`	Bulleted summary of available context.

Best-effort fields

Callers MUST NOT hard-fail when these are absent — they are emitted opportunistically and frequently drop in headless runs:

✓ Documentation updated substring (success-updated path is unreachable in claude -p — demoted by construction)
(mode:autofix) suffix on success-shape lines (lightweight mode emits its own suffix variants)
File created: / File updated: paths
Track: / Category: / Overlap detected: fields
Refresh candidate: and Discoverability recommendation: blocks
Phase 1 subagents: / Phase 3 specialized reviewers: blocks

Testing

bun test tests/compound-support-files.test.ts tests/pipeline-review-contract.test.ts - 51 pass / 0 fail.
bun run release:validate - clean.
bun test overall - 896 pass / 9 fail. The 9 failures are all in tests/skills/ce-polish-beta-project-type.test.ts and predate this branch.

…ded documentation ce-compound had no headless affordance. Its Execution Strategy enforces four mandatory blocking prompts (Full-vs-Lightweight mode selection, session-history opt-in, Discoverability Check consent, post-completion "What's next?" menu), each using the platform's blocking question tool with "Never silently skip" guarantees. This blocked programmatic callers (post-merge hooks, CI, orchestrator pipelines) — the blocking question tool errors in headless sessions and the documented fallback is "present in chat," with no user to respond. The inconsistency was visible against sibling skills in the same plugin: ce-compound-refresh ships mode:autofix, ce-code-review ships mode:autofix/mode:headless/mode:report-only, ce-doc-review ships mode:headless. ce-compound was the lone holdout. This PR adds two mode tokens: - mode:autofix — autofix + Lightweight execution (default headless). Low-cost single-pass capture; overlap drift caught later by ce-compound-refresh. - mode:autofix-full — autofix + Full execution (opt-in thorough). Phase 1 research subagents, overlap-update, Phase 3 specialized reviewers; session-history defaults to off. Mutually exclusive with a conflict error mirroring ce-code-review. All four interactive prompts are guarded in both modes. Discoverability Check becomes a recommendation in the output rather than an instruction- file edit. "What's next?" menu is suppressed. No-op output shape explains which precondition blocked when preconditions aren't met. Learning doc added at docs/solutions/skill-design/compound-skill-improvements.md covers the design decisions (explicit opt-in, Lightweight as default depth, two distinct mode tokens vs nested colons, tokenization rule for the mode:autofix-full prefix ambiguity). Tests: compound-specific (compound-support-files, pipeline-review-contract) 51/51 pass. release:validate clean. Full test suite: same 896/9 as main (the 9 failures are pre-existing in tests/skills/ce-polish-beta-project-type.test.ts, verified by stashing edits and rerunning).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9afa056dbe

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Phase 2.5's multi-candidate branch unconditionally asked the user whether to run a targeted refresh, which would block unattended mode:autofix-full runs (mode:autofix already dodged it via the lightweight rule, but the guard was implicit). Add an explicit autofix guard: recommend ce-compound-refresh via the Follow-up line instead of prompting, and enumerate this prompt in the shared autofix skip list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tmchow

Thanks for putting this together. The goal is right and I want to land this, but the execution is roughly five times the size of the sibling pattern it claims to follow, and it crosses a few of the plugin's own authoring rules in plugins/compound-engineering/AGENTS.md. I've flagged the specifics inline.

Headline asks:

Collapse Mode Detection to the shape used by ce-compound-refresh and ce-code-review. One table, one rules block, state each rule once.
Reconsider the two-token design. Orthogonal tokens (mode:autofix plus depth:full) or reusing mode:headless would sidestep the prefix tokenization rule and remove the parallel Use-when blocks.
Unify the four Autofix output templates into one parameterized block, or extract them to references/. Four near-duplicate blocks loaded on every invocation is exactly the case the AGENTS.md extraction rule was written for.
Cut runtime-neutral rationale: the Use-when bullets, the subagent-safety paragraph, the Note: inside the output template, and the caller-parsing instructions.
Drop or radically shrink docs/solutions/skill-design/compound-skill-improvements.md. ce-compound-refresh did not ship a parallel learning doc when it added mode:autofix and I don't want the precedent.

Happy to iterate. Once the size comes down I expect the rest of the review to move quickly.

tmchow

added comments.

@tmchow

…KILL.md Responds to @tmchow's PR 662 review. Replaces `mode:autofix` / `mode:autofix-full` with orthogonal `mode:autofix` + optional `depth:lightweight|full|thorough` (default `lightweight`). Collapses Mode Detection to sibling shape (one table + one rules block, 23 lines). Consolidates four ~90-line output templates into one parameterized success template + one short no-op template. Cuts runtime-neutral rationale flagged in review (per-variant Use-when bullets, subagent- safety paragraph, Note: in the output template, parser-contract paragraph). Moves the parser-facing contract to a new reference file not loaded at runtime. Deletes docs/solutions/skill-design/compound- skill-improvements.md — it does not meet the docs/solutions/ institutional-memory bar. BREAKING CHANGE: removes `mode:autofix-full`. Callers must use `mode:autofix depth:full` (full path, session history off) or `mode:autofix depth:thorough` (full path, session history on). No alias or migration shim — the token was introduced in this PR and has no merged users. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

benwilson · 2026-04-24T22:15:20Z

Thanks for the review @tmchow — all five asks addressed in 1988594:

Mode Detection collapsed (line 34) to 23 lines: one two-row table + depth:<preset> bullets + invalid-depth halt + four-bullet Autofix mode rules. Matches the ce-compound-refresh shape.
Token grammar reworked (line 44) to orthogonal mode:autofix + optional depth:lightweight|full|thorough. No prefix trap, no tokenization rule, no conflicting-mode-flags clause.
Output templates consolidated (line 520) to one parameterized success template + one short no-op template (~60 lines combined, stays in-body — 630 × 20% = 126-line extraction threshold).
Runtime-neutral rationale cut (lines 57, 70, 522, 544): per-variant Use-when bullets gone with their parent subsections, subagent-safety paragraph gone, Note: inside the output template gone, parser-contract paragraph moved to references/autofix-output-parse-contract.md (not loaded at runtime; contract also appears in the refreshed PR body for caller authors). "Print the full block" directive survives in the Autofix mode rules block.
Learning doc deleted — compound-skill-improvements.md is gone with no shrunken successor.

Net: -209 / +91 lines in the commit; SKILL.md dropped from 686 to 630. bun test 896/9 (same 9 pre-existing ce-polish-beta-project-type failures); bun run release:validate clean.

One pre-authorized divergence from your line-44 suggestion (follow-up to #discussion_r3140321365): went with three depth presets (lightweight|full|thorough) rather than a binary depth:full. Rationale: depth:thorough is the only way to expose the session-history opt-in to a headless caller without prompting. If you'd rather this collapse to binary, say the word and I'll fold thorough into full (or drop it entirely). Not planning to defend the three-preset choice past one round.

PR body refreshed to match the new grammar and carries the output contract table.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1988594053

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-24T22:17:49Z

+
+### Autofix mode rules
+
+- **Skip all user questions.** Never pause — the mode-selection prompt, Phase 2.5 selective-refresh prompt (multi-candidate case), session-history opt-in, Discoverability Check consent, and "What's next?" menu are all suppressed.


Guard autofix from post-run "What's next?" prompt

Autofix mode is documented here as suppressing all user questions, but the Success Output section still has an unconditional "present the What's next? options ... do not continue ... without the user's selection" instruction. In unattended invocations (/ce-compound mode:autofix ...), that conflicting requirement can still trigger a blocking question after writing docs and hang the caller. Add an explicit autofix guard at the post-output routing step so headless runs always terminate without waiting for input.

Useful? React with 👍 / 👎.

Fixed in a8cf13a.

You were right — the Mode Detection suppression alone wasn't strong enough to override the "present the options ... do not end the turn without the user's selection" imperative sitting at the prompt site. A headless claude -p --permission-mode bypassPermissions "/ce-compound mode:autofix depth:thorough ..." run before the fix showed the agent emitting the full numbered menu with a confused inline disclaimer ("Autofix mode was specified, so I'm stopping here rather than prompting"), which is exactly the conflicting-requirements case you flagged.

Added a seven-word guard at the prompt site: "In autofix mode, skip this step." Same test run after the fix returns no "What's next?" header, no options, no prompt — clean termination.

Context: my earlier U2 refactor (commit 1988594) deleted a longer version of this guard that used to sit below the prompt instruction, judging it redundant with the Autofix mode rules block. That was too aggressive — the rule is load-bearing at the point of action even when stated in Mode Detection.

chatgpt-codex-connector · 2026-04-24T22:17:49Z


+## Mode Detection
+
+Parse `$ARGUMENTS` for `mode:autofix` and an optional `depth:<value>` token. Strip matched tokens; treat the remainder as the brief context hint.


Restrict depth parsing to autofix mode only

These parsing rules treat any depth:<value> token as control input and pair that with a hard failure for unknown values, even when mode:autofix is absent. That means normal interactive calls can fail if the free-form context includes a depth: token (for example from pasted logs or prose) that is not one of the three presets. To avoid regressing interactive usage, only interpret/validate depth: when mode:autofix is present; otherwise leave it in the context hint.

Useful? React with 👍 / 👎.

Fixed in 3d032e1. Two minimal edits:

Parse instruction now conditional: Parse $ARGUMENTS for mode:autofix. If present, also parse an optional depth:<value> token. Strip matched tokens; the remainder is the context hint. — depth: is only parsed when autofix is present.

Halt rule scoped: In autofix mode, invalid depth: values halt before subagent dispatch and emit ... — invalid-depth error no longer fires outside autofix.

Verified headless: /ce-compound depth:bogus (no mode:autofix) now enters Interactive mode instead of emitting the halt envelope. The autofix-path halt still fires correctly on /ce-compound mode:autofix depth:bogus.

Net change: 2 insertions / 2 deletions — zero line growth, since both fixes fit on the existing single lines.

…ofix Phase 1 step 4 was written for interactive mode only ("only if the user opted in"); it never had an autofix branch. In `mode:autofix depth:thorough` — which the Mode Detection contract promises dispatches the Session Historian — the executing agent correctly followed the literal Phase 1 text and skipped dispatch. Rewrites the step 4 header and lead bullets to explicitly list both dispatch and skip conditions with their mode triggers, so the rule at the dispatch site matches the contract in Mode Detection. Caught by a headless `claude -p` test run against a scratch repo, which showed "Session Historian — mode:autofix arg implies skip" even though depth:thorough was passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Prior commit (6f409f8) restated the dispatch rule at the Phase 1 step 4 site, which was necessary to fix the skipped-dispatch bug, but included parenthetical rationale that just repeated what the condition already said: "(autofix mode — depth:thorough is the one autofix path that dispatches Session Historian)" and "(the depth:lightweight default and depth:full both skip Session Historian)". Per AGENTS.md rationale discipline, rationale that would not change runtime behavior if removed belongs elsewhere, not in SKILL.md. Also collapses the paired Dispatch/Skip bullets into one: the Skip condition was the logical inverse of Dispatch, so two bullets doubled text without adding runtime-actionable information. Net: 3 lines of dispatch guidance down to 1; same runtime effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Success Output section still carried an unconditional "present the 'What's next?' options ... Do not continue the workflow or end the turn without the user's selection" instruction at the prompt site. The Autofix mode rules block in Mode Detection lists the "What's next?" menu as suppressed, but that was too weak to override the direct imperative next to the prompt template, and a headless test run confirmed agents were emitting the menu anyway. Adds a four-word guard at the prompt site: "In autofix mode, skip this step." Load-bearing, no rationale, at the exact location where the drift was happening. Addresses EveryInc#662 (comment) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Prior text treated any depth:<value> token as control input regardless of mode, and the invalid-depth halt rule applied unconditionally. An interactive call whose free-form context happened to include a depth: token (pasted logs, prose, error messages) would hard-fail with an autofix-shaped error envelope even though the user never opted into autofix routing. Scopes both parsing and the halt rule to mode:autofix being present. When mode:autofix is absent, depth: is not parsed, not stripped, not validated — it stays in the context hint. Verified headless: /ce-compound depth:bogus (no mode:autofix) now enters Interactive mode instead of emitting the halt envelope. Addresses EveryInc#662 (comment) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

benwilson · 2026-04-24T23:54:38Z

Follow-ups since 1988594:

6f409f8b — dispatch Session Historian on depth:thorough in autofix. Caught by headless test: Phase 1 step 4 header still said "only if the user opted in," so the autofix dispatch branch never fired even though Mode Detection promised it would.
77d9de66 — tighten the Session Historian dispatch rule per AGENTS.md rationale discipline. Collapses parenthetical restatements 6f409f8 had introduced.
a8cf13a4 — guard the "What's next?" prompt at the prompt site. Codex P1 #discussion_r3140572986. Mode Detection suppression alone was too weak to override the direct "present the options" imperative nearby.
3d032e1d — scope depth: parsing and validation to autofix mode. Codex P2 #discussion_r3140572989. Interactive calls whose context happened to contain depth:... no longer hit the invalid-depth halt.

Known drift worth flagging before re-review: in headless claude -p --permission-mode bypassPermissions "/ce-compound mode:autofix depth:full|thorough ..." runs, agents reliably emit ✓ Documentation complete and File created: / File updated: anchors, but frequently drop the (mode:autofix) suffix, Depth:, Track:/Category:, and the structured Phase 1 subagents: / Phase 3 specialized reviewers: blocks — synthesizing their own summary with improvised section names instead. I tested six SKILL.md-text interventions (stronger imperatives, extraction to references/, explicit tool-call directives, structural marker changes) and none produced reliable contract compliance. Looks like a pattern that doesn't yield to modest text changes — a reliable fix would likely need subagent dispatch for the output step or a structured output format (YAML/JSON), both beyond this PR's review scope. Downstream parsers keying on the reliable anchors (first-line ✓ string + File created: / File updated:) will work; those depending on the unreliable anchors won't. Flagging so it isn't a surprise if you test headlessly — happy to address in a follow-up PR if it matters.

tmchow

@benwilson I updated the PR description to be better as your agent was weirdly making it into a changelog which wasn't necessary.

I'd like to change the mode values to lightweight, standard and deep to match what we do in other skills (vs lightweight, full, and thorough)

@tmchow

Renames `depth:full` → `depth:standard` and `depth:thorough` → `depth:deep` to match the naming convention tmchow set for depth values across skills. Also renames the prose "Full Mode" / "Full" interactive picker option to "Standard Mode" / "Standard" so the prose and tokens stay aligned. Per-PR feedback (EveryInc#662) from @tmchow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tmchow · 2026-05-05T22:03:08Z

@benwilson reconcile teh PR description to ensure it's accurate given the changes please

benwilson · 2026-05-05T22:06:34Z

Done — PR description reconciled with the current branch. It now uses depth:lightweight|standard|deep, updates the invalid-depth error text, and updates the output-contract table.

tmchow

On the headless drift you flagged in #662 (comment) — please address before merge rather than deferring. The current references/autofix-output-parse-contract.md promises stability for fields you've already confirmed don't hold ((mode:autofix) suffix, Depth:, Track:, Category:, Phase 1 subagents: / Phase 3 specialized reviewers: blocks). Shipping a contract that misleads downstream parsers is worse than shipping a smaller, accurate one — post-merge hooks and CI consumers will key off documented anchors and silently break when they drop.

Trim the contract in references/autofix-output-parse-contract.md to what actually holds reliably — the first-line ✓ Documentation complete|updated|No documentation written string and File created: / File updated: paths. Move every other anchor under a clearly labeled Best-effort fields (may be dropped in headless runs) section, with a one-line note that callers must not hard-fail when these are missing.
Update the Output contract table in the PR description to match.

Not asking you to land the structural fix here, that's a separate PR can tackle later. Just don't let the doc overstate what the runtime delivers today.

@tmchow

…iability @tmchow review #pullrequestreview-4231940863 asked us to split the parse contract into reliable / best-effort sections grounded in what actually holds in headless runs. Backed the split with an 18-run sweep against this branch covering {success-complete, no-op} x {lightweight, standard, deep} x 3 runs. Reliable: ✓ Documentation complete and ✓ No documentation written (mode:autofix) as shape-detector substrings (search anywhere; first-line discipline holds 0/18). When the no-op substring is present, Depth:, Reason:, Context considered: co-occur 100% (9/9 in measured cells). Best-effort: ✓ Documentation updated (success-updated path unreachable in claude -p), (mode:autofix) suffix on success lines (lightweight emits its own variants), File created: / File updated:, Track: / Category: / Overlap detected:, Refresh candidate: / Discoverability recommendation:, and the Phase 1 subagents: / Phase 3 specialized reviewers: blocks (0/18 anywhere). Callers must not hard-fail when these are missing. Structural fix to make best-effort anchors reliable is held for a follow-up PR per the original review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

benwilson · 2026-05-06T06:27:08Z

Headless reliability sweep — empirical findings (PR 662)

Following #pullrequestreview-4231940863, I ran a formal sweep of every parser-anchor field in references/autofix-output-parse-contract.md (and the SKILL.md template anchors not currently in the contract) so the reliable/best-effort split is data-driven rather than asserted.

Bench

Branch: feat/ce-compound-autofix @ 7c8156d4
CWD: /Users/bwilson/Documents/Source/compound-engineering-plugin (repo root)
docs/solutions/ entry count at SHA: 26
claude CLI: 2.1.129
Sweep elapsed: 13 minutes (18 total runs, all exit 0)

Methodology

For each cell {intended-shape} × {depth}, I ran 3 invocations of:

claude -p --permission-mode bypassPermissions "/ce-compound mode:autofix depth:<preset> <prompt>"

The success-updated shape was skipped by construction — overlap-update needs the Solution Extractor to produce a "new fix" alongside Related Docs Finder finding "high overlap," and the Solution Extractor reads conversation history, which claude -p doesn't have. Overlap detected: and File updated: are demoted to best-effort by construction.

Two prompts:

success-complete: detailed Jest+Redis fix description with named file, root cause, verification evidence (50 reruns).
no-op: vague "Something seems off with how the build is sometimes slow. Looking into it."

After each run, untracked docs/solutions/ files were cleaned so subsequent runs started from the same bench state.

Intended-vs-actual shape map

Cell	Run 1	Run 2	Run 3
complete, lightweight	complete	complete	other (free-form refusal)
complete, standard	noop	noop	noop
complete, deep	noop	noop	noop
noop, lightweight	other (free-form refusal)	other	other
noop, standard	noop	other	other
noop, deep	noop	other	noop

The standard/deep "complete" cells all flipped to no-op because Phase 1 verification caught that the claimed fix file (tests/setup/redis-pool.ts) doesn't exist in this checkout — the agent correctly refused to fabricate a learning doc. Lightweight mode skips that verification step and is more permissive.

First-line discipline (literal head -1)

Across all 18 runs, the documented ✓ ... first-line string is the literal first non-blank line in 0/18 runs. The agent always emits narration before the template block. Callers using head -1 will fail every time.

✓ line suffix observed when the shape was emitted

Run	Suffix
complete-lightweight-1	`(lightweight mode)`
complete-lightweight-2	`(lightweight + autofix mode)`
complete-lightweight-3	(no `✓` line — free-form refusal)
complete-standard-1..3	`(mode:autofix)` (but on the `✓ No documentation written` line, not `complete`)
complete-deep-1..3	`(mode:autofix)` (same as above)
noop-lightweight-1..3	(no `✓` line — free-form refusal)
noop-standard-1	`(mode:autofix)`
noop-standard-2..3	(no `✓` line — free-form refusal)
noop-deep-1, noop-deep-3	`(mode:autofix)`
noop-deep-2	(no `✓` line — free-form refusal)

The lightweight mode emits its own suffix variants ((lightweight mode) / (lightweight + autofix mode)), never (mode:autofix). The lightweight template at SKILL.md:385 ((lightweight mode) was added in #528, predates this PR) wasn't updated when this PR added the autofix grammar at SKILL.md:505 promising (mode:autofix). Autofix-lightweight runs fall through to the older lightweight template — a gap this PR introduced by promising a suffix the lightweight path doesn't produce.

Anchor presence (anywhere in stdout body)

Anchor	complete-lw	complete-std	complete-deep	noop-lw	noop-std	noop-deep
`✓ Documentation complete`	2/3	0/3	0/3	0/3	0/3	0/3
`✓ Documentation updated`	0/3	0/3	0/3	0/3	0/3	0/3
`✓ No documentation written`	0/3	3/3	3/3	0/3	1/3	2/3
`(mode:autofix)` on `✓` line	0/3	3/3	3/3	0/3	1/3	2/3
`^Depth:`	0/3	3/3	3/3	0/3	1/3	2/3
`^File created:`	2/3	0/3	0/3	0/3	0/3	0/3
`^File updated:`	0/3	0/3	0/3	0/3	0/3	0/3
`^Track:`	1/3	0/3	0/3	0/3	0/3	0/3
`^Category:`	0/3	0/3	0/3	0/3	0/3	0/3
`^Overlap detected:`	0/3	0/3	0/3	0/3	0/3	0/3
`^Refresh candidate:`	0/3	0/3	0/3	0/3	0/3	0/3
`^Discoverability recommendation:`	0/3	0/3	0/3	0/3	0/3	0/3
`^Reason:`	0/3	3/3	3/3	0/3	1/3	2/3
`^Context considered:`	0/3	3/3	3/3	0/3	1/3	2/3
`^Phase 1 subagents:`	0/3	0/3	0/3	0/3	0/3	0/3
`^Phase 3 specialized reviewers:`	0/3	0/3	0/3	0/3	0/3	0/3

Reliability classification

The data does not support keeping anything as "reliable" under a 100% / zero-drift bar. The pragmatic story:

No first-line anchor is reliable. Callers must search anywhere in stdout, not at line 1.
Phase 1 / Phase 3 subagent blocks are never emitted in headless mode (0/18 anywhere). They are template aspiration, not runtime reality.
Lightweight mode has its own incompatible output surface — different ✓ suffix on success, free-form prose on refusal. The contract that nominally covers all three depths actually only fits standard/deep, and only on the no-op path.
The no-op output is the only template that fires consistently when the shape is taken. When ✓ No documentation written (mode:autofix) appears, Depth:, Reason:, Context considered: co-occur 9/9 times. That is the strongest reliability signal in the data.
Success-complete anchors (File created:, Track:, Category:) are inconsistent even within the lightweight cell that did emit the success-complete shape — File created: appeared 2/3, Track: 1/3, Category: 0/3.

Resulting contract trim

Reliable section keeps the shape-detector substrings (search anywhere, not first-line) plus the no-op block's supporting anchors:

✓ Documentation complete — success-complete shape; (mode:autofix) suffix on this line is unreliable (see best-effort)
✓ No documentation written (mode:autofix) — no-op shape
When the no-op substring is present: Depth:, Reason:, Context considered: co-occur 100% (9/9 in measured cells)

Best-effort section: ✓ Documentation updated (unreachable in -p, demoted by construction), (mode:autofix) suffix on success lines, File created:, File updated:, Track:, Category:, Overlap detected:, Refresh candidate:, Discoverability recommendation:, Phase 1 subagents:, Phase 3 specialized reviewers:. Callers must not hard-fail when these are missing.

Raw run logs are at ${TMPDIR}/compound-engineering/pr-662-headless-reliability/20260505-155532/runs/ on my machine; the methodology and prompts above are sufficient to reproduce from any checkout of the same SHA.

The structural fix to make the demoted anchors reliable (subagent dispatch for the output step, structured output format, or aligning the lightweight template's suffix with the autofix template) is deferred to a follow-up PR per the original review.

benwilson · 2026-05-06T06:27:19Z

Re: #pullrequestreview-4231940863

Took the empirical route on this — ran an 18-run headless sweep against the current branch so the demotion list is data-grounded rather than asserted. Full methodology and per-cell tallies in the empirical findings comment.

The data largely supports your reliable set, with two adjustments:

Substring matching, not first-line discipline. No ✓ ... anchor was the literal first non-blank line of stdout in any run (0/18). The agent always prepends narration before the template block. Reliable claims are scoped to substring presence anywhere in stdout.
Demoted ✓ Documentation updated, File created:, File updated: to best-effort. ✓ Documentation updated is unreachable from claude -p (success-updated needs the Solution Extractor to read conversation history, which -p doesn't have) and was 0/18. File created: was 2/3 only in complete-lightweight runs that actually wrote a doc, 0 elsewhere. File updated: is in the same demoted-by-construction bucket as ✓ Documentation updated.

Trimmed references/autofix-output-parse-contract.md and the PR body's Output Contract table accordingly:

Reliable: ✓ Documentation complete and ✓ No documentation written (mode:autofix) as shape-detector substrings; when the no-op substring is present, Depth:, Reason:, Context considered: co-occur 100% (9/9 in measured cells).
Best-effort: the demoted anchors above plus (mode:autofix) suffix on success lines, Track:, Category:, Overlap detected:, Refresh candidate:, Discoverability recommendation:, and the Phase 1 subagents: / Phase 3 specialized reviewers: blocks (0/18 anywhere).

One runtime finding worth flagging beyond the contract trim: autofix-lightweight invocations emit a different suffix than the autofix template promises. Successes use (lightweight mode) or (lightweight + autofix mode), never (mode:autofix). The lightweight template at SKILL.md:385 (predates this PR — (lightweight mode) was added in #528) wasn't updated when this PR added the autofix grammar at SKILL.md:505 promising (mode:autofix). Autofix-lightweight runs fall through to the older lightweight template. Reconciling the two is a follow-up PR scope item.

The structural fix to make best-effort anchors reliable (subagent dispatch for the output step or a structured output format) is held for a follow-up PR per your earlier guidance.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 579e5da9c2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-06T06:31:15Z

+- `✓ Documentation complete` — success-complete shape (substring; the `(mode:autofix)` suffix is unreliable on this line, see best-effort)
+- `✓ No documentation written (mode:autofix)` — no-op shape


Recognize updated autofix runs as a reliable success shape

The contract currently treats only ✓ Documentation complete and the no-op line as reliable shape detectors, but this same change also defines a valid autofix success path that emits ✓ Documentation updated (mode:autofix) on overlap-high updates (depth:standard|deep). Parsers that follow this contract can misclassify successful update runs as unknown/failure, which can break headless callers (CI/hooks) that gate on parseable success. Please make ✓ Documentation updated a reliable success anchor (or ensure update output also includes the complete anchor).

Useful? React with 👍 / 👎.

benwilson marked this pull request as ready for review April 23, 2026 23:14

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread plugins/compound-engineering/skills/ce-compound/SKILL.md Outdated

tmchow requested changes Apr 24, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 24, 2026

View reviewed changes

benwilson and others added 4 commits April 24, 2026 15:54

benwilson requested a review from tmchow April 24, 2026 23:56

tmchow changed the title ~~feat(ce-compound): add mode:autofix and mode:autofix-full for unattended documentation~~ feat(ce-compound): add headless mode:autofix with depth selector Apr 28, 2026

tmchow requested changes Apr 28, 2026

View reviewed changes

benwilson requested a review from tmchow May 5, 2026 22:01

tmchow requested changes May 5, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 6, 2026

View reviewed changes


		### Autofix mode rules

		- Skip all user questions. Never pause — the mode-selection prompt, Phase 2.5 selective-refresh prompt (multi-candidate case), session-history opt-in, Discoverability Check consent, and "What's next?" menu are all suppressed.


		## Mode Detection

		Parse `$ARGUMENTS` for `mode:autofix` and an optional `depth:<value>` token. Strip matched tokens; treat the remainder as the brief context hint.

		- `✓ Documentation complete` — success-complete shape (substring; the `(mode:autofix)` suffix is unreliable on this line, see best-effort)
		- `✓ No documentation written (mode:autofix)` — no-op shape

Conversation

benwilson commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Output Contract

Reliable anchors

Best-effort fields

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

tmchow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tmchow left a comment

Choose a reason for hiding this comment

Uh oh!

benwilson commented Apr 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

benwilson Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

benwilson Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

benwilson commented Apr 24, 2026

Uh oh!

tmchow left a comment

Choose a reason for hiding this comment

Uh oh!

tmchow commented May 5, 2026

Uh oh!

benwilson commented May 5, 2026

Uh oh!

tmchow left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwilson commented May 6, 2026

Headless reliability sweep — empirical findings (PR 662)

Uh oh!

benwilson commented May 6, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

benwilson commented Apr 23, 2026 •

edited

Loading

tmchow left a comment •

edited

Loading