fix(llm): surface Azure/OpenAI gpt-5 thinking blocks via Responses API by ngoclam9415 · Pull Request #9 · aitomatic/dana-runtime

ngoclam9415 · 2026-05-06T17:30:32Z

Summary

Two-commit fix making gpt-5/o3/o4 reasoning visible through both stream() and chat() on Azure and OpenAI.

`2cf3b5d` — streaming fix

Wrapper listened for response.reasoning.delta (string the openai SDK never emits); replaced with the real events response.reasoning_summary_text.delta and response.reasoning_text.delta, and default reasoning.summary="auto" so summary deltas actually stream.
Azure unconditionally routed gpt-5/o3/o4 to /openai/responses, which 400s for api-version < 2025-03-01-preview. Added _responses_api_supported() hook; AzureProvider overrides it to gate on api-version date, falling back to Chat Completions instead of crashing.

`cf3eb86` — non-streaming chat() routed through Responses API

chat() for gpt-5/o3/o4 (when api-version supports it) now calls client.responses.create and parses the heterogeneous output[] array — populates LLMResponse.reasoning_content with reasoning summary text (was always None on this path).
chat() becomes a thin dispatcher with shared error handling. Old path extracted as _chat_via_chat_completions. New _chat_via_responses reuses existing helpers (_convert_to_responses_input, _prepare_tools_for_responses).
Maps response.status + incomplete_details.reason → Chat-Completions-style finish_reason (stop/tool_calls/length/incomplete).
Maps usage input_tokens/output_tokens → prompt_tokens/completion_tokens for caller compat.
Constructs ChatCompletionMessageToolCall Pydantic objects for function_call items so response_parser._to_tool_call_dicts sees the same shape regardless of API path.
json_mode=True maps to text={"format":{"type":"json_object"}}.

Routing matrix (same for `stream()` and `chat()`)

Scenario	Path
Azure gpt-5.2 + api-version `2025-03-01-preview`+	Responses API
Azure gpt-5.2 + older api-version	Chat Completions (fallback, no 400)
Azure gpt-4o (any version)	Chat Completions
OpenAI (non-Azure) gpt-5/o3/o4	Responses API (always supported)
Explicit `use_responses_api=False`	Chat Completions (override)

Test plan

tests/unit/llm/test_responses_api_routing.py — 23 cases: Azure version gate, routing decision (gpt-5+supported, gpt-5+old→fallback, non-reasoning, explicit-flag overrides), prefix matching for gpt-5/o3/o4
tests/unit/llm/test_chat_via_responses.py — 11 cases: reasoning_content extraction, content-text extraction (skipping refusals), tool-call Pydantic shape, finish_reason mapping (stop/tool_calls/length/incomplete), usage mapping, dispatch routing both directions, json_mode→text.format
tests/unit/llm/test_openai_streaming.py::test_reasoning_delta_yields_thinking — parametrized over both real SDK event names
Full tests/unit/llm/ — 126/126 passing
Live Azure gpt-5.2 via scripts/verify-azure-thinking.py:
- streaming: 182 thinking chunks, 837 chars (was 0)
- non-streaming reasoning_tokens: 363 (was ~118)
- non-streaming reasoning_content: 530+ chars of summary text (was always None)

Follow-ups (not in this PR)

Audit Anthropic and Gemini provider wrappers for similar thinking-surfacing gaps

The OpenAI-compatible streaming wrapper listened for the event type "response.reasoning.delta", which the openai SDK never emits. Real events are response.reasoning_summary_text.delta and response.reasoning_text.delta, so every reasoning delta was silently dropped. With reasoning.summary unset by default, the API also wouldn't emit summary events at all, even when a reasoning model was reasoning internally. Separately, Azure unconditionally routed gpt-5/o3/o4 to /openai/responses, which returns HTTP 400 BadRequest for api-version < 2025-03-01-preview. Changes: - openai_compatible_base: handle the real reasoning event names; default reasoning.summary="auto" so summary deltas stream; add _responses_api_supported() hook to gate routing on endpoint capability. - azure: override _responses_api_supported() to require api-version date >= 2025-03-01, falling back to Chat Completions on older versions instead of crashing. - tests: 23 new routing cases (version gate, prefix matching, config-flag override) plus updated reasoning-delta test to assert real SDK event names. 115/115 unit tests pass. - scripts/verify-azure-thinking.py: live-Azure verification of streaming thinking chunks and non-streaming reasoning_tokens.

Mirrors the routing already used by stream(). When _should_use_responses_api() is True (reasoning model + endpoint capability), non-streaming chat() now calls client.responses.create instead of client.chat.completions.create, and parses the heterogeneous output[] array to populate LLMResponse.reasoning_content with the model's reasoning summary text — previously always None on this path. Implementation: - chat() becomes a thin dispatcher with shared error handling. - _chat_via_chat_completions: existing path, extracted unchanged. - _chat_via_responses: new path. Reuses _convert_to_responses_input and _prepare_tools_for_responses helpers. Builds reasoning={"summary":"auto"} by default so summary text actually streams back. Maps response.status + incomplete_details.reason to a Chat-Completions-style finish_reason (stop/tool_calls/length/incomplete). Maps usage input/output_tokens to prompt/completion_tokens for caller compatibility. Constructs ChatCompletionMessageToolCall objects for function_call items so downstream parsers (response_parser._to_tool_call_dicts) see the same Pydantic shape regardless of API path. json_mode maps to text.format. Tests: - 11 new unit tests in test_chat_via_responses.py covering reasoning extraction, content extraction (skipping refusals), tool-call shape, finish_reason mapping (stop/tool_calls/length/incomplete), usage mapping, dispatch routing (both directions), and json_mode mapping. - 126/126 unit tests pass. Live Azure gpt-5.2: chat() now returns reasoning_content with 1044+ chars of summary text (was None). reasoning_tokens=455. Verified via scripts/verify-azure-thinking.py.

Without an explicit effort, gpt-5* sometimes skips reasoning entirely on a given call — leaving reasoning_content empty and reasoning summary deltas unfiring even with summary='auto'. Observed live with Azure gpt-5.2: identical requests yielded 0 thinking chunks on some calls and 200+ on others, purely model nondeterminism. The wrapper is the right place to set this: it already routes reasoning-model traffic to the Responses API, so it knows when reasoning is the expected mode. Callers can still override (e.g. effort='low' for cheaper turns). Both _chat_via_responses and _stream_responses now setdefault: effort = "medium" summary = "auto" (already there) Tests: 3 new cases covering default behavior, effort override, and summary override. 129/129 unit tests pass. Live verification (Azure gpt-5.2 via DanaCodingAgent): wrapper now deterministically returns 967 chars of reasoning_content. Tracking confirms the wrapper-side issue is fully fixed; remaining agent-side issue (star_agent.py:750-757 drops reasoning on direct-answer turns) is a separate concern not addressed here. Adds scripts/verify-thinking-persisted-via-coding-agent.py for end-to-end inspection of timeline persistence.

When the model answered without invoking a tool, _record_think_results took the no-tool-calls branch (star_agent.py:750) which only added AGENT_RESPONSE — silently dropping the parsed reasoning. The else branch (tool-calls path) already added AGENT_THOUGHTS for non-empty reasoning; the no-tool-calls branch was the asymmetric outlier. This dropped reasoning text on: - direct-answer turns (model solves the puzzle without tools) - the final turn of any tool-using session (after tools, model answers) Affects all reasoning surfaces routed through codec_with_native_tool_use: - LLMResponse.reasoning_content (gpt-5/o3/o4 via Responses API, DeepSeek-R1, future Anthropic extended thinking) - <thinking> XML tags - JSON {"reasoning": ...} fields Live verification (Azure gpt-5.2 via DanaCodingAgent): - direct scenario: 1373 chars reasoning persisted as AGENT_THOUGHTS (was 0) - tool scenario: AGENT_THOUGHTS entry now appears even on the final answer turn For non-reasoning models, parsed.reasoning is None, the new block is a no-op, behavior is unchanged. 129/129 unit tests pass.

Add OPENAI_THINKING_EFFORT and AZURE_THINKING_EFFORT env vars (plus generic LLM_REASONING_EFFORT fallback) so operators can dial reasoning cost/latency without code changes. Precedence: 1. caller's reasoning.effort kwarg 2. provider env var (AZURE_THINKING_EFFORT / OPENAI_THINKING_EFFORT) 3. LLM_REASONING_EFFORT 4. "low" (was "medium") Default flipped to "low" so callers opt into deeper reasoning rather than paying for medium implicitly. Invalid env values are logged and ignored. Applies to both _chat_via_responses and _stream_responses.

ngoclam9415 added 5 commits May 7, 2026 00:20

ngoclam9415 mentioned this pull request May 10, 2026

feat(llm,timeline): cross-turn reasoning state replay for gpt-5/o3/o4 #10

Merged

4 tasks

ngoclam9415 merged commit 68cbad3 into develop May 10, 2026
1 check failed

ngoclam9415 deleted the fix/azure-gpt5-thinking-blocks branch May 10, 2026 14:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(llm): surface Azure/OpenAI gpt-5 thinking blocks via Responses API#9

fix(llm): surface Azure/OpenAI gpt-5 thinking blocks via Responses API#9
ngoclam9415 merged 5 commits into
developfrom
fix/azure-gpt5-thinking-blocks

ngoclam9415 commented May 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ngoclam9415 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

2cf3b5d — streaming fix

cf3eb86 — non-streaming chat() routed through Responses API

Routing matrix (same for stream() and chat())

Test plan

Follow-ups (not in this PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ngoclam9415 commented May 6, 2026 •

edited

Loading

`2cf3b5d` — streaming fix

`cf3eb86` — non-streaming chat() routed through Responses API

Routing matrix (same for `stream()` and `chat()`)