fix(llm): surface Azure/OpenAI gpt-5 thinking blocks via Responses API#9
Merged
Merged
Conversation
The OpenAI-compatible streaming wrapper listened for the event type "response.reasoning.delta", which the openai SDK never emits. Real events are response.reasoning_summary_text.delta and response.reasoning_text.delta, so every reasoning delta was silently dropped. With reasoning.summary unset by default, the API also wouldn't emit summary events at all, even when a reasoning model was reasoning internally. Separately, Azure unconditionally routed gpt-5/o3/o4 to /openai/responses, which returns HTTP 400 BadRequest for api-version < 2025-03-01-preview. Changes: - openai_compatible_base: handle the real reasoning event names; default reasoning.summary="auto" so summary deltas stream; add _responses_api_supported() hook to gate routing on endpoint capability. - azure: override _responses_api_supported() to require api-version date >= 2025-03-01, falling back to Chat Completions on older versions instead of crashing. - tests: 23 new routing cases (version gate, prefix matching, config-flag override) plus updated reasoning-delta test to assert real SDK event names. 115/115 unit tests pass. - scripts/verify-azure-thinking.py: live-Azure verification of streaming thinking chunks and non-streaming reasoning_tokens.
Mirrors the routing already used by stream(). When _should_use_responses_api()
is True (reasoning model + endpoint capability), non-streaming chat() now
calls client.responses.create instead of client.chat.completions.create,
and parses the heterogeneous output[] array to populate
LLMResponse.reasoning_content with the model's reasoning summary text —
previously always None on this path.
Implementation:
- chat() becomes a thin dispatcher with shared error handling.
- _chat_via_chat_completions: existing path, extracted unchanged.
- _chat_via_responses: new path. Reuses _convert_to_responses_input and
_prepare_tools_for_responses helpers. Builds reasoning={"summary":"auto"}
by default so summary text actually streams back. Maps response.status
+ incomplete_details.reason to a Chat-Completions-style finish_reason
(stop/tool_calls/length/incomplete). Maps usage input/output_tokens to
prompt/completion_tokens for caller compatibility. Constructs
ChatCompletionMessageToolCall objects for function_call items so
downstream parsers (response_parser._to_tool_call_dicts) see the same
Pydantic shape regardless of API path. json_mode maps to text.format.
Tests:
- 11 new unit tests in test_chat_via_responses.py covering reasoning
extraction, content extraction (skipping refusals), tool-call shape,
finish_reason mapping (stop/tool_calls/length/incomplete), usage
mapping, dispatch routing (both directions), and json_mode mapping.
- 126/126 unit tests pass.
Live Azure gpt-5.2: chat() now returns reasoning_content with 1044+ chars
of summary text (was None). reasoning_tokens=455. Verified via
scripts/verify-azure-thinking.py.
Without an explicit effort, gpt-5* sometimes skips reasoning entirely on a given call — leaving reasoning_content empty and reasoning summary deltas unfiring even with summary='auto'. Observed live with Azure gpt-5.2: identical requests yielded 0 thinking chunks on some calls and 200+ on others, purely model nondeterminism. The wrapper is the right place to set this: it already routes reasoning-model traffic to the Responses API, so it knows when reasoning is the expected mode. Callers can still override (e.g. effort='low' for cheaper turns). Both _chat_via_responses and _stream_responses now setdefault: effort = "medium" summary = "auto" (already there) Tests: 3 new cases covering default behavior, effort override, and summary override. 129/129 unit tests pass. Live verification (Azure gpt-5.2 via DanaCodingAgent): wrapper now deterministically returns 967 chars of reasoning_content. Tracking confirms the wrapper-side issue is fully fixed; remaining agent-side issue (star_agent.py:750-757 drops reasoning on direct-answer turns) is a separate concern not addressed here. Adds scripts/verify-thinking-persisted-via-coding-agent.py for end-to-end inspection of timeline persistence.
When the model answered without invoking a tool, _record_think_results
took the no-tool-calls branch (star_agent.py:750) which only added
AGENT_RESPONSE — silently dropping the parsed reasoning. The else branch
(tool-calls path) already added AGENT_THOUGHTS for non-empty reasoning;
the no-tool-calls branch was the asymmetric outlier.
This dropped reasoning text on:
- direct-answer turns (model solves the puzzle without tools)
- the final turn of any tool-using session (after tools, model answers)
Affects all reasoning surfaces routed through codec_with_native_tool_use:
- LLMResponse.reasoning_content (gpt-5/o3/o4 via Responses API,
DeepSeek-R1, future Anthropic extended thinking)
- <thinking> XML tags
- JSON {"reasoning": ...} fields
Live verification (Azure gpt-5.2 via DanaCodingAgent):
- direct scenario: 1373 chars reasoning persisted as AGENT_THOUGHTS
(was 0)
- tool scenario: AGENT_THOUGHTS entry now appears even on the final
answer turn
For non-reasoning models, parsed.reasoning is None, the new block is
a no-op, behavior is unchanged.
129/129 unit tests pass.
Add OPENAI_THINKING_EFFORT and AZURE_THINKING_EFFORT env vars (plus generic LLM_REASONING_EFFORT fallback) so operators can dial reasoning cost/latency without code changes. Precedence: 1. caller's reasoning.effort kwarg 2. provider env var (AZURE_THINKING_EFFORT / OPENAI_THINKING_EFFORT) 3. LLM_REASONING_EFFORT 4. "low" (was "medium") Default flipped to "low" so callers opt into deeper reasoning rather than paying for medium implicitly. Invalid env values are logged and ignored. Applies to both _chat_via_responses and _stream_responses.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two-commit fix making gpt-5/o3/o4 reasoning visible through both
stream()andchat()on Azure and OpenAI.2cf3b5d— streaming fixresponse.reasoning.delta(string the openai SDK never emits); replaced with the real eventsresponse.reasoning_summary_text.deltaandresponse.reasoning_text.delta, and defaultreasoning.summary="auto"so summary deltas actually stream./openai/responses, which 400s forapi-version < 2025-03-01-preview. Added_responses_api_supported()hook;AzureProvideroverrides it to gate on api-version date, falling back to Chat Completions instead of crashing.cf3eb86— non-streaming chat() routed through Responses APIchat()for gpt-5/o3/o4 (when api-version supports it) now callsclient.responses.createand parses the heterogeneousoutput[]array — populatesLLMResponse.reasoning_contentwith reasoning summary text (was alwaysNoneon this path).chat()becomes a thin dispatcher with shared error handling. Old path extracted as_chat_via_chat_completions. New_chat_via_responsesreuses existing helpers (_convert_to_responses_input,_prepare_tools_for_responses).response.status+incomplete_details.reason→ Chat-Completions-stylefinish_reason(stop/tool_calls/length/incomplete).input_tokens/output_tokens→prompt_tokens/completion_tokensfor caller compat.ChatCompletionMessageToolCallPydantic objects forfunction_callitems soresponse_parser._to_tool_call_dictssees the same shape regardless of API path.json_mode=Truemaps totext={"format":{"type":"json_object"}}.Routing matrix (same for
stream()andchat())2025-03-01-preview+use_responses_api=FalseTest plan
tests/unit/llm/test_responses_api_routing.py— 23 cases: Azure version gate, routing decision (gpt-5+supported, gpt-5+old→fallback, non-reasoning, explicit-flag overrides), prefix matching for gpt-5/o3/o4tests/unit/llm/test_chat_via_responses.py— 11 cases: reasoning_content extraction, content-text extraction (skipping refusals), tool-call Pydantic shape, finish_reason mapping (stop/tool_calls/length/incomplete), usage mapping, dispatch routing both directions, json_mode→text.formattests/unit/llm/test_openai_streaming.py::test_reasoning_delta_yields_thinking— parametrized over both real SDK event namestests/unit/llm/— 126/126 passingscripts/verify-azure-thinking.py:reasoning_tokens: 363 (was ~118)reasoning_content: 530+ chars of summary text (was always None)Follow-ups (not in this PR)