Add extended-thinking support; default reviewer to adaptive thinking by mattgodbolt-molty · Pull Request #17 · compiler-explorer/explain

mattgodbolt-molty · 2026-05-06T21:20:21Z

Summary

Plumb Anthropic's extended thinking through the prompt-testing pipeline and the production explain handler, and turn it on by default for the eval reviewer. Production behaviour is unchanged — production prompt.yaml does not set a thinking field, so the explainer path is identical to today.

What's wired up

Prompt reads an optional model.thinking block (e.g. {type: adaptive} or {type: enabled, budget_tokens: 2000}) and surfaces it via generate_messages().
app/explain.py and prompt_testing/runner.py build api_kwargs conditionally: when thinking is set, drop temperature (API rejects it) and pick the last text block from the response (thinking blocks precede text).
CorrectnessReviewer accepts an optional thinking config and uses the same multi-block extraction.
CLI: prompt-test run gains --reviewer-thinking off|adaptive and prompt-test review gains --thinking off|adaptive. Default is adaptive for both — the rigor win on eval is worth the cost; pass off to compare or save money.

Why default reviewer thinking on?

21-case eval suite, baseline (Sonnet 4.6 explainer, Opus 4.7 reviewer):

	correct	errors	warns	avg out tok	avg latency	total $
baseline	18/21	5	11	525	11.0s	$0.55
explainer thinks	20/21	1	10	1363	22.3s	$0.78
reviewer thinks	18/21	1	6	525	11.0s	$0.73
both think	18/21	2	5	1217	21.0s	$0.93

Explainer thinking dramatically improves output quality (errors 5 → 1, eliminates the recurring imul eax, edi, edi invention 2/5 → 0/5 in focused 5-run testing). Held back from production: +11s latency is too much for an interactive endpoint.
Reviewer thinking catches different and often more rigorous errors (e.g. correctly flags factorial_beginner_assembly mis-naming TCO that the no-thinking reviewer missed). +~70% reviewer cost, no user-facing latency since reviews are offline. Worth defaulting on.

Important gotcha for future explainer-thinking experiments

max_tokens must be large enough to leave room for the text response after the thinking budget. Production's 1536 starved the visible output entirely on 4 complex cases when adaptive thinking was enabled — bump to ≥4096 (I used 8192 in testing) before turning thinking on for any explainer prompt.

Test plan

uv run pytest — 92 tests pass (mock updated to set type=\"text\" for new multi-block extraction).
uv run pre-commit run --all-files.
Live runs (results above): baseline, explainer-thinking, reviewer-thinking, both-thinking — across the full 21-case suite, confirming the default change behaves correctly end-to-end.
Smoke-tested prompt-test run --reviewer-thinking off and --reviewer-thinking adaptive to confirm both code paths.

🤖 Generated with Claude Code

Plumbs Anthropic's extended thinking through the prompt-testing pipeline and the production explain handler. The production prompt is unchanged (no `thinking` field set), so behaviour and cost are identical until a config opts in. Findings that motivated this change (21-case suite, full results in PR description): - Sonnet 4.6 explainer with adaptive thinking: errors flagged 5 -> 1. Eliminates the recurring `imul eax, edi, edi` invention (2/5 -> 0/5 in focused 5-run testing). Cost +~$0.013/req, latency +11s. Held back from production: too much wall-clock for an interactive endpoint. - Opus 4.7 reviewer with adaptive thinking: catches different errors and finds real issues the no-thinking reviewer misses (e.g. correctly flags `factorial_beginner_assembly` mis-naming TCO). ~70% extra reviewer cost, no user-facing latency since reviews are offline. Worth defaulting on for eval rigor. Implementation: - `Prompt` reads an optional `model.thinking` block (e.g. `{type: adaptive}` or `{type: enabled, budget_tokens: 2000}`) and surfaces it via `generate_messages()`. - `app/explain.py` and `prompt_testing/runner.py` both now construct api_kwargs conditionally: with thinking, drop `temperature` (the API rejects it); pick the last text block from the response (thinking blocks come first when enabled). - `CorrectnessReviewer` accepts an optional `thinking` config and applies the same multi-block extraction. - `prompt-test run` and `prompt-test review` gain `--reviewer-thinking / --thinking` flags. Default is `adaptive` because the rigor win on eval is worth the extra cost; pass `off` to compare or save money. Important gotcha: with adaptive thinking, `max_tokens` must be large enough to leave room for the visible response after the thinking budget. Production prompt's 1536 starves the text output on complex cases — bump to >=4096 if a future variant enables thinking. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a short section covering the non-obvious things this PR's session turned up that would burn time to rediscover: max_tokens-includes- thinking, the reviewer's temperature deprecation, the new default reviewer thinking, the deliberate hold on production explainer thinking, and the multi-block response pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Plumbs Anthropic “extended thinking” configuration through prompt generation, the production explain handler, and the prompt-testing pipeline, and enables adaptive thinking by default for the offline correctness reviewer (with production behavior unchanged unless thinking is set in prompt.yaml).

Changes:

Add optional model.thinking support in Prompt and propagate it to message generation.
Conditionally build Anthropic API kwargs: omit temperature when thinking is set; extract the final response from the last type="text" content block.
Extend prompt-test CLI/reviewer to accept thinking config (defaulting reviewer thinking to adaptive).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`app/prompt.py`	Reads optional `model.thinking` and includes it in generated prompt data.
`app/explain.py`	Builds API kwargs conditionally and extracts last text block from multi-block responses.
`app/test_explain.py`	Updates mock response blocks to include `type="text"`.
`prompt_testing/runner.py`	Builds API kwargs conditionally and extracts last text block from multi-block responses.
`prompt_testing/reviewer.py`	Adds optional reviewer thinking config and last-text-block extraction for multi-block responses.
`prompt_testing/cli.py`	Adds CLI flags to control reviewer thinking (default `adaptive`) and wires config through review runs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mattgodbolt-molty · 2026-05-06T21:35:21Z

@@ -92,16 +92,24 @@ async def _call_anthropic_api(
    # Call Claude API
    LOGGER.info("Using Anthropic client with model: %s", {prompt_data["model"]})


Good catch — pre-existing bug, fixed in acdb10d. Dropped the braces so the model id is logged as a string rather than a single-element set.

mattgodbolt-molty · 2026-05-06T21:35:23Z

+    # Pick the last text block — when thinking is enabled the response
+    # contains thinking blocks before the final text block.
+    text_blocks = [c for c in message.content if getattr(c, "type", None) == "text"]
+    explanation = text_blocks[-1].text.strip() if text_blocks else ""


Agreed — and exactly the failure mode the CLAUDE.md gotcha calls out. Fixed in acdb10d: now raises RuntimeError with the stop_reason so we don't cache or serve an empty success. Added a test in app/test_explain.py for the no-text-block case.

mattgodbolt-molty · 2026-05-06T21:35:24Z

+                text_blocks = [c for c in msg.content if getattr(c, "type", None) == "text"]
+                explanation = text_blocks[-1].text.strip() if text_blocks else ""
                return {
                    "case_id": case_id,
                    "success": True,
-                    "explanation": msg.content[-1].text.strip(),
+                    "explanation": explanation,


Yes — I literally hit this on 4/21 cases during the thinking experiments and it skewed the suite metrics. Fixed in acdb10d: returns success=False with the stop_reason and token counts so the eval clearly surfaces the failure.

mattgodbolt-molty · 2026-05-06T21:35:25Z

    mock_content = MagicMock()
+    mock_content.type = "text"
    mock_content.text = "This assembly code implements a simple square function..."
    mock_message.content = [mock_content]



Added in acdb10d: test_picks_last_text_block_with_thinking (thinking block before text, asserts the text block is selected) and test_raises_when_no_text_block (thinking-only response raises). 94 tests passing.

- app/explain.py: drop the accidental set literal in the model log so it prints the model id rather than `{'claude-...'}`. - app/explain.py: raise RuntimeError when the response contains no text block (e.g. extended thinking exhausting max_tokens), so we don't cache or serve an empty success response. - prompt_testing/runner.py: same empty-output guard, returning success=False with stop_reason and token counts so suite metrics reflect reality instead of silently absorbing empty responses. - app/test_explain.py: add coverage for multi-block responses (thinking + text picks the text block) and the empty-block case (raises RuntimeError). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

app/prompt.py:286

Prompt.generate_messages() now returns a thinking field, but the docstring’s “Returns a dict with:” list doesn’t mention it. Update the docstring to reflect the new return shape so callers and tests don’t rely on outdated documentation.

        return {
            "model": self.model,
            "max_tokens": self.max_tokens,
            "temperature": self.temperature,
            "thinking": self.thinking,
            "system": system_prompt,
            "messages": messages,
            "structured_data": structured_data,  # Include for reference
        }

mattgodbolt-molty · 2026-05-06T22:45:46Z

+        # When thinking is enabled the response contains thinking blocks
+        # before the final text block; pick the last text block.
+        text_blocks = [c for c in msg.content if getattr(c, "type", None) == "text"]
+        text = text_blocks[-1].text.strip() if text_blocks else ""



Fixed in acdb10d. The empty-text branch now returns a structured result with correct=None, summary containing stop_reason and token counts, and reviewer_input/output_tokens populated — no more falling through to the JSON parse failure path.

mattgodbolt-molty · 2026-05-06T22:45:45Z

+    api_kwargs: dict = {
+        "model": prompt_data["model"],
+        "max_tokens": prompt_data["max_tokens"],
+        "system": prompt_data["system"],
+        "messages": prompt_data["messages"],
+    }


Fixed in acdb10d — api_kwargs: dict[str, Any] (and Any imported from typing).

mattgodbolt-molty · 2026-05-06T22:45:44Z

+    if not explanation:
+        # Can happen if extended thinking exhausts max_tokens before any
+        # text block is emitted. Don't cache or return an empty response.
+        raise RuntimeError(
+            f"Claude returned no text content (stop_reason={message.stop_reason}). "
+            f"If thinking is enabled, max_tokens may be too low."
+        )


Fixed in 01d80c8 (different shape than your suggestion): instead of raising and converting to a 500, _call_anthropic_api now returns ExplainResponse(status="error") with usage populated, emits ClaudeExplainEmptyResponse, and process_request skips caching error responses. Token usage is included in the error message so it's diagnosable from logs.

mattgodbolt-molty · 2026-05-06T22:45:47Z

+async def _run_reviews(project_root: Path, results: dict, model: str, thinking: dict | None = None) -> dict:
    """Run correctness reviews on all successful results."""
    from prompt_testing.reviewer import CorrectnessReviewer

-    reviewer = CorrectnessReviewer(model=model)
+    reviewer = CorrectnessReviewer(model=model, thinking=thinking)
    test_dir = project_root / "prompt_testing" / "test_cases"


Fixed in 01d80c8 — _run_reviews and _print_review_summary now use dict[str, Any] consistently (with Any imported).

Findings beyond Copilot's pass: - app/explain.py: empty-text responses no longer raise (which surfaced to users as a generic 500 with no payload). They now return ExplainResponse(status="error") with usage populated, emit a ClaudeExplainEmptyResponse metric, and are not cached so retries hit the API. - prompt_testing/runner.py: empty-response failures now capture input_tokens/output_tokens/elapsed_ms/model and contribute to total_cost_usd. Previously the spent tokens vanished from suite cost. - app/cache.py: cache_data explicitly includes the thinking config. Currently captured via prompt_version's hash, but the documented cache contract is the explicit field list — listing thinking keeps that contract honest. - app/prompt.py: Prompt.__init__ refuses thinking-enabled configs with max_tokens < 4096. Below this, thinking can consume the whole budget and starve the visible text. Fail at load time rather than silently producing empty responses. - prompt_testing/cli.py: distinguish review_failures (correct=None, reviewer infrastructure failed) from errors_found (correct=False, real factual error). Previously conflated under errors_found, which hid the difference between "model said something wrong" and "we couldn't review the response at all." - prompt_testing/cli.py: tighten bare dict annotations to dict[str, Any] for consistency. - CLAUDE.md: note redacted_thinking blocks (rare, encrypted reasoning when safety filters trip) and document the structured-error contract for empty responses. Tests: 94 -> 96 passing (added two for the new Prompt validation, plus updated the empty-text test to assert the structured error response shape). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This file is per-user session state and shouldn't be tracked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

mattgodbolt-molty · 2026-05-06T22:45:48Z

    cache_data = {
        "model": prompt_data["model"],
        "max_tokens": prompt_data["max_tokens"],
        "temperature": prompt_data["temperature"],
+        # Thinking config affects the API response so it must affect the cache
+        # key. Currently captured via `prompt_version` too, but listing it
+        # explicitly keeps the contract clear if thinking ever gets set
+        # outside the YAML (e.g. per-request override).
+        "thinking": prompt_data.get("thinking"),
        "system": prompt_data["system"],


Good catch on the cache invalidation cost. I initially gated this behind if thinking is not None (49df42b) to preserve existing entries, but the maintainer preferred the simpler unconditional include and is fine with the one-time flush. So df76c7a reverts to the original unconditional shape.

mattgodbolt-molty · 2026-05-06T22:45:49Z

        successful = [r for r in results if r["success"]]
+        # Cost includes failures that consumed tokens (e.g. thinking exhausted
+        # max_tokens before any text was emitted) — those aren't free.
        total_cost = sum(
-            r["input_tokens"] * 3 / 1e6 + r["output_tokens"] * 15 / 1e6  # Sonnet pricing
-            for r in successful
+            r.get("input_tokens", 0) * 3 / 1e6 + r.get("output_tokens", 0) * 15 / 1e6  # Sonnet pricing
+            for r in results
        )


Real bug — fixed in 49df42b. Now uses get_model_cost(prompt.model) to compute per-token rates, matching the pattern in cli.py's reviewer cost calc. Cost stays correct for any explainer model loaded via the runner.

- app/cache.py: only include `thinking` in cache_data when it's set. Always serialising the field (even as null) would silently invalidate every existing S3 cache entry on deploy, costing a wave of fresh API calls for cases where nothing about the prompt has changed. - prompt_testing/runner.py: compute total_cost via get_model_cost(prompt.model) rather than hardcoded Sonnet rates. The runner can load arbitrary prompt YAMLs, so hardcoding misreports cost whenever the prompt model isn't Sonnet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per maintainer preference, accept the one-time S3 cache flush in exchange for unconditionally including `thinking` in cache_data. Keeps the cache-key contract aligned with the docstring (explicit field list). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mattgodbolt requested a review from Copilot May 6, 2026 21:26

Copilot started reviewing on behalf of mattgodbolt May 6, 2026 21:27 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

mattgodbolt requested a review from Copilot May 6, 2026 21:45

Copilot started reviewing on behalf of mattgodbolt May 6, 2026 21:46 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

mattgodbolt-molty and others added 2 commits May 6, 2026 16:59

Ignore Claude Code's local scheduled_tasks.lock

cbb0f71

This file is per-user session state and shouldn't be tracked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mattgodbolt requested a review from Copilot May 6, 2026 22:33

Copilot started reviewing on behalf of mattgodbolt May 6, 2026 22:33 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

mattgodbolt-molty and others added 2 commits May 6, 2026 17:45

mattgodbolt merged commit a88f8a1 into main May 6, 2026
2 checks passed

mattgodbolt deleted the add-thinking-support branch May 6, 2026 22:46

		@@ -92,16 +92,24 @@ async def _call_anthropic_api(
		# Call Claude API
		LOGGER.info("Using Anthropic client with model: %s", {prompt_data["model"]})

Conversation

mattgodbolt-molty commented May 6, 2026

Summary

What's wired up

Why default reviewer thinking on?

Important gotcha for future explainer-thinking experiments

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants