Add extended-thinking support; default reviewer to adaptive thinking#17
Add extended-thinking support; default reviewer to adaptive thinking#17mattgodbolt merged 7 commits intomainfrom
Conversation
Plumbs Anthropic's extended thinking through the prompt-testing pipeline
and the production explain handler. The production prompt is unchanged
(no `thinking` field set), so behaviour and cost are identical until a
config opts in.
Findings that motivated this change (21-case suite, full results in PR
description):
- Sonnet 4.6 explainer with adaptive thinking: errors flagged 5 -> 1.
Eliminates the recurring `imul eax, edi, edi` invention (2/5 -> 0/5
in focused 5-run testing). Cost +~$0.013/req, latency +11s. Held back
from production: too much wall-clock for an interactive endpoint.
- Opus 4.7 reviewer with adaptive thinking: catches different errors
and finds real issues the no-thinking reviewer misses (e.g. correctly
flags `factorial_beginner_assembly` mis-naming TCO). ~70% extra
reviewer cost, no user-facing latency since reviews are offline.
Worth defaulting on for eval rigor.
Implementation:
- `Prompt` reads an optional `model.thinking` block (e.g. `{type:
adaptive}` or `{type: enabled, budget_tokens: 2000}`) and surfaces it
via `generate_messages()`.
- `app/explain.py` and `prompt_testing/runner.py` both now construct
api_kwargs conditionally: with thinking, drop `temperature` (the API
rejects it); pick the last text block from the response (thinking
blocks come first when enabled).
- `CorrectnessReviewer` accepts an optional `thinking` config and
applies the same multi-block extraction.
- `prompt-test run` and `prompt-test review` gain `--reviewer-thinking
/ --thinking` flags. Default is `adaptive` because the rigor win on
eval is worth the extra cost; pass `off` to compare or save money.
Important gotcha: with adaptive thinking, `max_tokens` must be large
enough to leave room for the visible response after the thinking
budget. Production prompt's 1536 starves the text output on complex
cases — bump to >=4096 if a future variant enables thinking.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a short section covering the non-obvious things this PR's session turned up that would burn time to rediscover: max_tokens-includes- thinking, the reviewer's temperature deprecation, the new default reviewer thinking, the deliberate hold on production explainer thinking, and the multi-block response pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Plumbs Anthropic “extended thinking” configuration through prompt generation, the production explain handler, and the prompt-testing pipeline, and enables adaptive thinking by default for the offline correctness reviewer (with production behavior unchanged unless thinking is set in prompt.yaml).
Changes:
- Add optional
model.thinkingsupport inPromptand propagate it to message generation. - Conditionally build Anthropic API kwargs: omit
temperaturewhenthinkingis set; extract the final response from the lasttype="text"content block. - Extend prompt-test CLI/reviewer to accept
thinkingconfig (defaulting reviewer thinking toadaptive).
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
app/prompt.py |
Reads optional model.thinking and includes it in generated prompt data. |
app/explain.py |
Builds API kwargs conditionally and extracts last text block from multi-block responses. |
app/test_explain.py |
Updates mock response blocks to include type="text". |
prompt_testing/runner.py |
Builds API kwargs conditionally and extracts last text block from multi-block responses. |
prompt_testing/reviewer.py |
Adds optional reviewer thinking config and last-text-block extraction for multi-block responses. |
prompt_testing/cli.py |
Adds CLI flags to control reviewer thinking (default adaptive) and wires config through review runs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -92,16 +92,24 @@ async def _call_anthropic_api( | |||
| # Call Claude API | |||
| LOGGER.info("Using Anthropic client with model: %s", {prompt_data["model"]}) | |||
There was a problem hiding this comment.
Good catch — pre-existing bug, fixed in acdb10d. Dropped the braces so the model id is logged as a string rather than a single-element set.
| # Pick the last text block — when thinking is enabled the response | ||
| # contains thinking blocks before the final text block. | ||
| text_blocks = [c for c in message.content if getattr(c, "type", None) == "text"] | ||
| explanation = text_blocks[-1].text.strip() if text_blocks else "" |
There was a problem hiding this comment.
Agreed — and exactly the failure mode the CLAUDE.md gotcha calls out. Fixed in acdb10d: now raises RuntimeError with the stop_reason so we don't cache or serve an empty success. Added a test in app/test_explain.py for the no-text-block case.
| text_blocks = [c for c in msg.content if getattr(c, "type", None) == "text"] | ||
| explanation = text_blocks[-1].text.strip() if text_blocks else "" | ||
| return { | ||
| "case_id": case_id, | ||
| "success": True, | ||
| "explanation": msg.content[-1].text.strip(), | ||
| "explanation": explanation, |
There was a problem hiding this comment.
Yes — I literally hit this on 4/21 cases during the thinking experiments and it skewed the suite metrics. Fixed in acdb10d: returns success=False with the stop_reason and token counts so the eval clearly surfaces the failure.
| mock_content = MagicMock() | ||
| mock_content.type = "text" | ||
| mock_content.text = "This assembly code implements a simple square function..." | ||
| mock_message.content = [mock_content] | ||
|
|
There was a problem hiding this comment.
Added in acdb10d: test_picks_last_text_block_with_thinking (thinking block before text, asserts the text block is selected) and test_raises_when_no_text_block (thinking-only response raises). 94 tests passing.
- app/explain.py: drop the accidental set literal in the model log so it
prints the model id rather than `{'claude-...'}`.
- app/explain.py: raise RuntimeError when the response contains no text
block (e.g. extended thinking exhausting max_tokens), so we don't
cache or serve an empty success response.
- prompt_testing/runner.py: same empty-output guard, returning
success=False with stop_reason and token counts so suite metrics
reflect reality instead of silently absorbing empty responses.
- app/test_explain.py: add coverage for multi-block responses
(thinking + text picks the text block) and the empty-block case
(raises RuntimeError).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
app/prompt.py:286
Prompt.generate_messages()now returns athinkingfield, but the docstring’s “Returns a dict with:” list doesn’t mention it. Update the docstring to reflect the new return shape so callers and tests don’t rely on outdated documentation.
return {
"model": self.model,
"max_tokens": self.max_tokens,
"temperature": self.temperature,
"thinking": self.thinking,
"system": system_prompt,
"messages": messages,
"structured_data": structured_data, # Include for reference
}
| # When thinking is enabled the response contains thinking blocks | ||
| # before the final text block; pick the last text block. | ||
| text_blocks = [c for c in msg.content if getattr(c, "type", None) == "text"] | ||
| text = text_blocks[-1].text.strip() if text_blocks else "" | ||
|
|
There was a problem hiding this comment.
Fixed in acdb10d. The empty-text branch now returns a structured result with correct=None, summary containing stop_reason and token counts, and reviewer_input/output_tokens populated — no more falling through to the JSON parse failure path.
| api_kwargs: dict = { | ||
| "model": prompt_data["model"], | ||
| "max_tokens": prompt_data["max_tokens"], | ||
| "system": prompt_data["system"], | ||
| "messages": prompt_data["messages"], | ||
| } |
There was a problem hiding this comment.
Fixed in acdb10d — api_kwargs: dict[str, Any] (and Any imported from typing).
| if not explanation: | ||
| # Can happen if extended thinking exhausts max_tokens before any | ||
| # text block is emitted. Don't cache or return an empty response. | ||
| raise RuntimeError( | ||
| f"Claude returned no text content (stop_reason={message.stop_reason}). " | ||
| f"If thinking is enabled, max_tokens may be too low." | ||
| ) |
There was a problem hiding this comment.
Fixed in 01d80c8 (different shape than your suggestion): instead of raising and converting to a 500, _call_anthropic_api now returns ExplainResponse(status="error") with usage populated, emits ClaudeExplainEmptyResponse, and process_request skips caching error responses. Token usage is included in the error message so it's diagnosable from logs.
| async def _run_reviews(project_root: Path, results: dict, model: str, thinking: dict | None = None) -> dict: | ||
| """Run correctness reviews on all successful results.""" | ||
| from prompt_testing.reviewer import CorrectnessReviewer | ||
|
|
||
| reviewer = CorrectnessReviewer(model=model) | ||
| reviewer = CorrectnessReviewer(model=model, thinking=thinking) | ||
| test_dir = project_root / "prompt_testing" / "test_cases" |
There was a problem hiding this comment.
Fixed in 01d80c8 — _run_reviews and _print_review_summary now use dict[str, Any] consistently (with Any imported).
Findings beyond Copilot's pass: - app/explain.py: empty-text responses no longer raise (which surfaced to users as a generic 500 with no payload). They now return ExplainResponse(status="error") with usage populated, emit a ClaudeExplainEmptyResponse metric, and are not cached so retries hit the API. - prompt_testing/runner.py: empty-response failures now capture input_tokens/output_tokens/elapsed_ms/model and contribute to total_cost_usd. Previously the spent tokens vanished from suite cost. - app/cache.py: cache_data explicitly includes the thinking config. Currently captured via prompt_version's hash, but the documented cache contract is the explicit field list — listing thinking keeps that contract honest. - app/prompt.py: Prompt.__init__ refuses thinking-enabled configs with max_tokens < 4096. Below this, thinking can consume the whole budget and starve the visible text. Fail at load time rather than silently producing empty responses. - prompt_testing/cli.py: distinguish review_failures (correct=None, reviewer infrastructure failed) from errors_found (correct=False, real factual error). Previously conflated under errors_found, which hid the difference between "model said something wrong" and "we couldn't review the response at all." - prompt_testing/cli.py: tighten bare dict annotations to dict[str, Any] for consistency. - CLAUDE.md: note redacted_thinking blocks (rare, encrypted reasoning when safety filters trip) and document the structured-error contract for empty responses. Tests: 94 -> 96 passing (added two for the new Prompt validation, plus updated the empty-text test to assert the structured error response shape). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file is per-user session state and shouldn't be tracked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| cache_data = { | ||
| "model": prompt_data["model"], | ||
| "max_tokens": prompt_data["max_tokens"], | ||
| "temperature": prompt_data["temperature"], | ||
| # Thinking config affects the API response so it must affect the cache | ||
| # key. Currently captured via `prompt_version` too, but listing it | ||
| # explicitly keeps the contract clear if thinking ever gets set | ||
| # outside the YAML (e.g. per-request override). | ||
| "thinking": prompt_data.get("thinking"), | ||
| "system": prompt_data["system"], |
There was a problem hiding this comment.
| successful = [r for r in results if r["success"]] | ||
| # Cost includes failures that consumed tokens (e.g. thinking exhausted | ||
| # max_tokens before any text was emitted) — those aren't free. | ||
| total_cost = sum( | ||
| r["input_tokens"] * 3 / 1e6 + r["output_tokens"] * 15 / 1e6 # Sonnet pricing | ||
| for r in successful | ||
| r.get("input_tokens", 0) * 3 / 1e6 + r.get("output_tokens", 0) * 15 / 1e6 # Sonnet pricing | ||
| for r in results | ||
| ) |
There was a problem hiding this comment.
Real bug — fixed in 49df42b. Now uses get_model_cost(prompt.model) to compute per-token rates, matching the pattern in cli.py's reviewer cost calc. Cost stays correct for any explainer model loaded via the runner.
- app/cache.py: only include `thinking` in cache_data when it's set. Always serialising the field (even as null) would silently invalidate every existing S3 cache entry on deploy, costing a wave of fresh API calls for cases where nothing about the prompt has changed. - prompt_testing/runner.py: compute total_cost via get_model_cost(prompt.model) rather than hardcoded Sonnet rates. The runner can load arbitrary prompt YAMLs, so hardcoding misreports cost whenever the prompt model isn't Sonnet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per maintainer preference, accept the one-time S3 cache flush in exchange for unconditionally including `thinking` in cache_data. Keeps the cache-key contract aligned with the docstring (explicit field list). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Plumb Anthropic's extended thinking through the prompt-testing pipeline and the production explain handler, and turn it on by default for the eval reviewer. Production behaviour is unchanged — production
prompt.yamldoes not set athinkingfield, so the explainer path is identical to today.What's wired up
Promptreads an optionalmodel.thinkingblock (e.g.{type: adaptive}or{type: enabled, budget_tokens: 2000}) and surfaces it viagenerate_messages().app/explain.pyandprompt_testing/runner.pybuildapi_kwargsconditionally: when thinking is set, droptemperature(API rejects it) and pick the last text block from the response (thinking blocks precede text).CorrectnessRevieweraccepts an optionalthinkingconfig and uses the same multi-block extraction.prompt-test rungains--reviewer-thinking off|adaptiveandprompt-test reviewgains--thinking off|adaptive. Default isadaptivefor both — the rigor win on eval is worth the cost; passoffto compare or save money.Why default reviewer thinking on?
21-case eval suite, baseline (Sonnet 4.6 explainer, Opus 4.7 reviewer):
imul eax, edi, ediinvention 2/5 → 0/5 in focused 5-run testing). Held back from production: +11s latency is too much for an interactive endpoint.factorial_beginner_assemblymis-naming TCO that the no-thinking reviewer missed). +~70% reviewer cost, no user-facing latency since reviews are offline. Worth defaulting on.Important gotcha for future explainer-thinking experiments
max_tokensmust be large enough to leave room for the text response after the thinking budget. Production's1536starved the visible output entirely on 4 complex cases when adaptive thinking was enabled — bump to ≥4096 (I used 8192 in testing) before turning thinking on for any explainer prompt.Test plan
uv run pytest— 92 tests pass (mock updated to settype=\"text\"for new multi-block extraction).uv run pre-commit run --all-files.prompt-test run --reviewer-thinking offand--reviewer-thinking adaptiveto confirm both code paths.🤖 Generated with Claude Code