Skip to content

Add extended-thinking support; default reviewer to adaptive thinking#17

Merged
mattgodbolt merged 7 commits intomainfrom
add-thinking-support
May 6, 2026
Merged

Add extended-thinking support; default reviewer to adaptive thinking#17
mattgodbolt merged 7 commits intomainfrom
add-thinking-support

Conversation

@mattgodbolt-molty
Copy link
Copy Markdown
Contributor

Summary

Plumb Anthropic's extended thinking through the prompt-testing pipeline and the production explain handler, and turn it on by default for the eval reviewer. Production behaviour is unchanged — production prompt.yaml does not set a thinking field, so the explainer path is identical to today.

What's wired up

  • Prompt reads an optional model.thinking block (e.g. {type: adaptive} or {type: enabled, budget_tokens: 2000}) and surfaces it via generate_messages().
  • app/explain.py and prompt_testing/runner.py build api_kwargs conditionally: when thinking is set, drop temperature (API rejects it) and pick the last text block from the response (thinking blocks precede text).
  • CorrectnessReviewer accepts an optional thinking config and uses the same multi-block extraction.
  • CLI: prompt-test run gains --reviewer-thinking off|adaptive and prompt-test review gains --thinking off|adaptive. Default is adaptive for both — the rigor win on eval is worth the cost; pass off to compare or save money.

Why default reviewer thinking on?

21-case eval suite, baseline (Sonnet 4.6 explainer, Opus 4.7 reviewer):

correct errors warns avg out tok avg latency total $
baseline 18/21 5 11 525 11.0s $0.55
explainer thinks 20/21 1 10 1363 22.3s $0.78
reviewer thinks 18/21 1 6 525 11.0s $0.73
both think 18/21 2 5 1217 21.0s $0.93
  • Explainer thinking dramatically improves output quality (errors 5 → 1, eliminates the recurring imul eax, edi, edi invention 2/5 → 0/5 in focused 5-run testing). Held back from production: +11s latency is too much for an interactive endpoint.
  • Reviewer thinking catches different and often more rigorous errors (e.g. correctly flags factorial_beginner_assembly mis-naming TCO that the no-thinking reviewer missed). +~70% reviewer cost, no user-facing latency since reviews are offline. Worth defaulting on.

Important gotcha for future explainer-thinking experiments

max_tokens must be large enough to leave room for the text response after the thinking budget. Production's 1536 starved the visible output entirely on 4 complex cases when adaptive thinking was enabled — bump to ≥4096 (I used 8192 in testing) before turning thinking on for any explainer prompt.

Test plan

  • uv run pytest — 92 tests pass (mock updated to set type=\"text\" for new multi-block extraction).
  • uv run pre-commit run --all-files.
  • Live runs (results above): baseline, explainer-thinking, reviewer-thinking, both-thinking — across the full 21-case suite, confirming the default change behaves correctly end-to-end.
  • Smoke-tested prompt-test run --reviewer-thinking off and --reviewer-thinking adaptive to confirm both code paths.

🤖 Generated with Claude Code

Plumbs Anthropic's extended thinking through the prompt-testing pipeline
and the production explain handler. The production prompt is unchanged
(no `thinking` field set), so behaviour and cost are identical until a
config opts in.

Findings that motivated this change (21-case suite, full results in PR
description):

- Sonnet 4.6 explainer with adaptive thinking: errors flagged 5 -> 1.
  Eliminates the recurring `imul eax, edi, edi` invention (2/5 -> 0/5
  in focused 5-run testing). Cost +~$0.013/req, latency +11s. Held back
  from production: too much wall-clock for an interactive endpoint.
- Opus 4.7 reviewer with adaptive thinking: catches different errors
  and finds real issues the no-thinking reviewer misses (e.g. correctly
  flags `factorial_beginner_assembly` mis-naming TCO). ~70% extra
  reviewer cost, no user-facing latency since reviews are offline.
  Worth defaulting on for eval rigor.

Implementation:

- `Prompt` reads an optional `model.thinking` block (e.g. `{type:
  adaptive}` or `{type: enabled, budget_tokens: 2000}`) and surfaces it
  via `generate_messages()`.
- `app/explain.py` and `prompt_testing/runner.py` both now construct
  api_kwargs conditionally: with thinking, drop `temperature` (the API
  rejects it); pick the last text block from the response (thinking
  blocks come first when enabled).
- `CorrectnessReviewer` accepts an optional `thinking` config and
  applies the same multi-block extraction.
- `prompt-test run` and `prompt-test review` gain `--reviewer-thinking
  / --thinking` flags. Default is `adaptive` because the rigor win on
  eval is worth the extra cost; pass `off` to compare or save money.

Important gotcha: with adaptive thinking, `max_tokens` must be large
enough to leave room for the visible response after the thinking
budget. Production prompt's 1536 starves the text output on complex
cases — bump to >=4096 if a future variant enables thinking.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a short section covering the non-obvious things this PR's session
turned up that would burn time to rediscover: max_tokens-includes-
thinking, the reviewer's temperature deprecation, the new default
reviewer thinking, the deliberate hold on production explainer
thinking, and the multi-block response pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Plumbs Anthropic “extended thinking” configuration through prompt generation, the production explain handler, and the prompt-testing pipeline, and enables adaptive thinking by default for the offline correctness reviewer (with production behavior unchanged unless thinking is set in prompt.yaml).

Changes:

  • Add optional model.thinking support in Prompt and propagate it to message generation.
  • Conditionally build Anthropic API kwargs: omit temperature when thinking is set; extract the final response from the last type="text" content block.
  • Extend prompt-test CLI/reviewer to accept thinking config (defaulting reviewer thinking to adaptive).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
app/prompt.py Reads optional model.thinking and includes it in generated prompt data.
app/explain.py Builds API kwargs conditionally and extracts last text block from multi-block responses.
app/test_explain.py Updates mock response blocks to include type="text".
prompt_testing/runner.py Builds API kwargs conditionally and extracts last text block from multi-block responses.
prompt_testing/reviewer.py Adds optional reviewer thinking config and last-text-block extraction for multi-block responses.
prompt_testing/cli.py Adds CLI flags to control reviewer thinking (default adaptive) and wires config through review runs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread app/explain.py Outdated
@@ -92,16 +92,24 @@ async def _call_anthropic_api(
# Call Claude API
LOGGER.info("Using Anthropic client with model: %s", {prompt_data["model"]})
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — pre-existing bug, fixed in acdb10d. Dropped the braces so the model id is logged as a string rather than a single-element set.

Comment thread app/explain.py
Comment on lines +109 to +112
# Pick the last text block — when thinking is enabled the response
# contains thinking blocks before the final text block.
text_blocks = [c for c in message.content if getattr(c, "type", None) == "text"]
explanation = text_blocks[-1].text.strip() if text_blocks else ""
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — and exactly the failure mode the CLAUDE.md gotcha calls out. Fixed in acdb10d: now raises RuntimeError with the stop_reason so we don't cache or serve an empty success. Added a test in app/test_explain.py for the no-text-block case.

Comment thread prompt_testing/runner.py
Comment on lines +88 to +93
text_blocks = [c for c in msg.content if getattr(c, "type", None) == "text"]
explanation = text_blocks[-1].text.strip() if text_blocks else ""
return {
"case_id": case_id,
"success": True,
"explanation": msg.content[-1].text.strip(),
"explanation": explanation,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — I literally hit this on 4/21 cases during the thinking experiments and it skewed the suite metrics. Fixed in acdb10d: returns success=False with the stop_reason and token counts so the eval clearly surfaces the failure.

Comment thread app/test_explain.py
Comment on lines 59 to 63
mock_content = MagicMock()
mock_content.type = "text"
mock_content.text = "This assembly code implements a simple square function..."
mock_message.content = [mock_content]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in acdb10d: test_picks_last_text_block_with_thinking (thinking block before text, asserts the text block is selected) and test_raises_when_no_text_block (thinking-only response raises). 94 tests passing.

- app/explain.py: drop the accidental set literal in the model log so it
  prints the model id rather than `{'claude-...'}`.
- app/explain.py: raise RuntimeError when the response contains no text
  block (e.g. extended thinking exhausting max_tokens), so we don't
  cache or serve an empty success response.
- prompt_testing/runner.py: same empty-output guard, returning
  success=False with stop_reason and token counts so suite metrics
  reflect reality instead of silently absorbing empty responses.
- app/test_explain.py: add coverage for multi-block responses
  (thinking + text picks the text block) and the empty-block case
  (raises RuntimeError).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

app/prompt.py:286

  • Prompt.generate_messages() now returns a thinking field, but the docstring’s “Returns a dict with:” list doesn’t mention it. Update the docstring to reflect the new return shape so callers and tests don’t rely on outdated documentation.
        return {
            "model": self.model,
            "max_tokens": self.max_tokens,
            "temperature": self.temperature,
            "thinking": self.thinking,
            "system": system_prompt,
            "messages": messages,
            "structured_data": structured_data,  # Include for reference
        }

Comment on lines +118 to 122
# When thinking is enabled the response contains thinking blocks
# before the final text block; pick the last text block.
text_blocks = [c for c in msg.content if getattr(c, "type", None) == "text"]
text = text_blocks[-1].text.strip() if text_blocks else ""

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in acdb10d. The empty-text branch now returns a structured result with correct=None, summary containing stop_reason and token counts, and reviewer_input/output_tokens populated — no more falling through to the JSON parse failure path.

Comment thread app/explain.py Outdated
Comment on lines +95 to +100
api_kwargs: dict = {
"model": prompt_data["model"],
"max_tokens": prompt_data["max_tokens"],
"system": prompt_data["system"],
"messages": prompt_data["messages"],
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in acdb10dapi_kwargs: dict[str, Any] (and Any imported from typing).

Comment thread app/explain.py
Comment on lines +113 to +119
if not explanation:
# Can happen if extended thinking exhausts max_tokens before any
# text block is emitted. Don't cache or return an empty response.
raise RuntimeError(
f"Claude returned no text content (stop_reason={message.stop_reason}). "
f"If thinking is enabled, max_tokens may be too low."
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 01d80c8 (different shape than your suggestion): instead of raising and converting to a 500, _call_anthropic_api now returns ExplainResponse(status="error") with usage populated, emits ClaudeExplainEmptyResponse, and process_request skips caching error responses. Token usage is included in the error message so it's diagnosable from logs.

Comment thread prompt_testing/cli.py Outdated
Comment on lines 205 to 210
async def _run_reviews(project_root: Path, results: dict, model: str, thinking: dict | None = None) -> dict:
"""Run correctness reviews on all successful results."""
from prompt_testing.reviewer import CorrectnessReviewer

reviewer = CorrectnessReviewer(model=model)
reviewer = CorrectnessReviewer(model=model, thinking=thinking)
test_dir = project_root / "prompt_testing" / "test_cases"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 01d80c8_run_reviews and _print_review_summary now use dict[str, Any] consistently (with Any imported).

mattgodbolt-molty and others added 2 commits May 6, 2026 16:59
Findings beyond Copilot's pass:

- app/explain.py: empty-text responses no longer raise (which surfaced
  to users as a generic 500 with no payload). They now return
  ExplainResponse(status="error") with usage populated, emit a
  ClaudeExplainEmptyResponse metric, and are not cached so retries hit
  the API.
- prompt_testing/runner.py: empty-response failures now capture
  input_tokens/output_tokens/elapsed_ms/model and contribute to
  total_cost_usd. Previously the spent tokens vanished from suite cost.
- app/cache.py: cache_data explicitly includes the thinking config.
  Currently captured via prompt_version's hash, but the documented
  cache contract is the explicit field list — listing thinking keeps
  that contract honest.
- app/prompt.py: Prompt.__init__ refuses thinking-enabled configs with
  max_tokens < 4096. Below this, thinking can consume the whole budget
  and starve the visible text. Fail at load time rather than silently
  producing empty responses.
- prompt_testing/cli.py: distinguish review_failures (correct=None,
  reviewer infrastructure failed) from errors_found (correct=False,
  real factual error). Previously conflated under errors_found, which
  hid the difference between "model said something wrong" and "we
  couldn't review the response at all."
- prompt_testing/cli.py: tighten bare dict annotations to
  dict[str, Any] for consistency.
- CLAUDE.md: note redacted_thinking blocks (rare, encrypted reasoning
  when safety filters trip) and document the structured-error contract
  for empty responses.

Tests: 94 -> 96 passing (added two for the new Prompt validation, plus
updated the empty-text test to assert the structured error response
shape).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file is per-user session state and shouldn't be tracked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

Comment thread app/cache.py
Comment on lines 141 to 150
cache_data = {
"model": prompt_data["model"],
"max_tokens": prompt_data["max_tokens"],
"temperature": prompt_data["temperature"],
# Thinking config affects the API response so it must affect the cache
# key. Currently captured via `prompt_version` too, but listing it
# explicitly keeps the contract clear if thinking ever gets set
# outside the YAML (e.g. per-request override).
"thinking": prompt_data.get("thinking"),
"system": prompt_data["system"],
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the cache invalidation cost. I initially gated this behind if thinking is not None (49df42b) to preserve existing entries, but the maintainer preferred the simpler unconditional include and is fine with the one-time flush. So df76c7a reverts to the original unconditional shape.

Comment thread prompt_testing/runner.py
Comment on lines 151 to 157
successful = [r for r in results if r["success"]]
# Cost includes failures that consumed tokens (e.g. thinking exhausted
# max_tokens before any text was emitted) — those aren't free.
total_cost = sum(
r["input_tokens"] * 3 / 1e6 + r["output_tokens"] * 15 / 1e6 # Sonnet pricing
for r in successful
r.get("input_tokens", 0) * 3 / 1e6 + r.get("output_tokens", 0) * 15 / 1e6 # Sonnet pricing
for r in results
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real bug — fixed in 49df42b. Now uses get_model_cost(prompt.model) to compute per-token rates, matching the pattern in cli.py's reviewer cost calc. Cost stays correct for any explainer model loaded via the runner.

mattgodbolt-molty and others added 2 commits May 6, 2026 17:45
- app/cache.py: only include `thinking` in cache_data when it's set.
  Always serialising the field (even as null) would silently invalidate
  every existing S3 cache entry on deploy, costing a wave of fresh API
  calls for cases where nothing about the prompt has changed.
- prompt_testing/runner.py: compute total_cost via get_model_cost(prompt.model)
  rather than hardcoded Sonnet rates. The runner can load arbitrary
  prompt YAMLs, so hardcoding misreports cost whenever the prompt
  model isn't Sonnet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per maintainer preference, accept the one-time S3 cache flush in
exchange for unconditionally including `thinking` in cache_data. Keeps
the cache-key contract aligned with the docstring (explicit field
list).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mattgodbolt mattgodbolt merged commit a88f8a1 into main May 6, 2026
2 checks passed
@mattgodbolt mattgodbolt deleted the add-thinking-support branch May 6, 2026 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants