Skip to content

fix(llm): surface Azure/OpenAI gpt-5 thinking blocks via Responses API#9

Merged
ngoclam9415 merged 5 commits into
developfrom
fix/azure-gpt5-thinking-blocks
May 10, 2026
Merged

fix(llm): surface Azure/OpenAI gpt-5 thinking blocks via Responses API#9
ngoclam9415 merged 5 commits into
developfrom
fix/azure-gpt5-thinking-blocks

Conversation

@ngoclam9415
Copy link
Copy Markdown
Contributor

@ngoclam9415 ngoclam9415 commented May 6, 2026

Summary

Two-commit fix making gpt-5/o3/o4 reasoning visible through both stream() and chat() on Azure and OpenAI.

2cf3b5d — streaming fix

  • Wrapper listened for response.reasoning.delta (string the openai SDK never emits); replaced with the real events response.reasoning_summary_text.delta and response.reasoning_text.delta, and default reasoning.summary="auto" so summary deltas actually stream.
  • Azure unconditionally routed gpt-5/o3/o4 to /openai/responses, which 400s for api-version < 2025-03-01-preview. Added _responses_api_supported() hook; AzureProvider overrides it to gate on api-version date, falling back to Chat Completions instead of crashing.

cf3eb86 — non-streaming chat() routed through Responses API

  • chat() for gpt-5/o3/o4 (when api-version supports it) now calls client.responses.create and parses the heterogeneous output[] array — populates LLMResponse.reasoning_content with reasoning summary text (was always None on this path).
  • chat() becomes a thin dispatcher with shared error handling. Old path extracted as _chat_via_chat_completions. New _chat_via_responses reuses existing helpers (_convert_to_responses_input, _prepare_tools_for_responses).
  • Maps response.status + incomplete_details.reason → Chat-Completions-style finish_reason (stop/tool_calls/length/incomplete).
  • Maps usage input_tokens/output_tokensprompt_tokens/completion_tokens for caller compat.
  • Constructs ChatCompletionMessageToolCall Pydantic objects for function_call items so response_parser._to_tool_call_dicts sees the same shape regardless of API path.
  • json_mode=True maps to text={"format":{"type":"json_object"}}.

Routing matrix (same for stream() and chat())

Scenario Path
Azure gpt-5.2 + api-version 2025-03-01-preview+ Responses API
Azure gpt-5.2 + older api-version Chat Completions (fallback, no 400)
Azure gpt-4o (any version) Chat Completions
OpenAI (non-Azure) gpt-5/o3/o4 Responses API (always supported)
Explicit use_responses_api=False Chat Completions (override)

Test plan

  • tests/unit/llm/test_responses_api_routing.py — 23 cases: Azure version gate, routing decision (gpt-5+supported, gpt-5+old→fallback, non-reasoning, explicit-flag overrides), prefix matching for gpt-5/o3/o4
  • tests/unit/llm/test_chat_via_responses.py — 11 cases: reasoning_content extraction, content-text extraction (skipping refusals), tool-call Pydantic shape, finish_reason mapping (stop/tool_calls/length/incomplete), usage mapping, dispatch routing both directions, json_mode→text.format
  • tests/unit/llm/test_openai_streaming.py::test_reasoning_delta_yields_thinking — parametrized over both real SDK event names
  • Full tests/unit/llm/126/126 passing
  • Live Azure gpt-5.2 via scripts/verify-azure-thinking.py:
    • streaming: 182 thinking chunks, 837 chars (was 0)
    • non-streaming reasoning_tokens: 363 (was ~118)
    • non-streaming reasoning_content: 530+ chars of summary text (was always None)

Follow-ups (not in this PR)

  • Audit Anthropic and Gemini provider wrappers for similar thinking-surfacing gaps

The OpenAI-compatible streaming wrapper listened for the event type
"response.reasoning.delta", which the openai SDK never emits. Real events
are response.reasoning_summary_text.delta and response.reasoning_text.delta,
so every reasoning delta was silently dropped. With reasoning.summary
unset by default, the API also wouldn't emit summary events at all, even
when a reasoning model was reasoning internally.

Separately, Azure unconditionally routed gpt-5/o3/o4 to /openai/responses,
which returns HTTP 400 BadRequest for api-version < 2025-03-01-preview.

Changes:
- openai_compatible_base: handle the real reasoning event names; default
  reasoning.summary="auto" so summary deltas stream; add
  _responses_api_supported() hook to gate routing on endpoint capability.
- azure: override _responses_api_supported() to require api-version date
  >= 2025-03-01, falling back to Chat Completions on older versions
  instead of crashing.
- tests: 23 new routing cases (version gate, prefix matching, config-flag
  override) plus updated reasoning-delta test to assert real SDK event
  names. 115/115 unit tests pass.
- scripts/verify-azure-thinking.py: live-Azure verification of streaming
  thinking chunks and non-streaming reasoning_tokens.
Mirrors the routing already used by stream(). When _should_use_responses_api()
is True (reasoning model + endpoint capability), non-streaming chat() now
calls client.responses.create instead of client.chat.completions.create,
and parses the heterogeneous output[] array to populate
LLMResponse.reasoning_content with the model's reasoning summary text —
previously always None on this path.

Implementation:
- chat() becomes a thin dispatcher with shared error handling.
- _chat_via_chat_completions: existing path, extracted unchanged.
- _chat_via_responses: new path. Reuses _convert_to_responses_input and
  _prepare_tools_for_responses helpers. Builds reasoning={"summary":"auto"}
  by default so summary text actually streams back. Maps response.status
  + incomplete_details.reason to a Chat-Completions-style finish_reason
  (stop/tool_calls/length/incomplete). Maps usage input/output_tokens to
  prompt/completion_tokens for caller compatibility. Constructs
  ChatCompletionMessageToolCall objects for function_call items so
  downstream parsers (response_parser._to_tool_call_dicts) see the same
  Pydantic shape regardless of API path. json_mode maps to text.format.

Tests:
- 11 new unit tests in test_chat_via_responses.py covering reasoning
  extraction, content extraction (skipping refusals), tool-call shape,
  finish_reason mapping (stop/tool_calls/length/incomplete), usage
  mapping, dispatch routing (both directions), and json_mode mapping.
- 126/126 unit tests pass.

Live Azure gpt-5.2: chat() now returns reasoning_content with 1044+ chars
of summary text (was None). reasoning_tokens=455. Verified via
scripts/verify-azure-thinking.py.
Without an explicit effort, gpt-5* sometimes skips reasoning entirely on a
given call — leaving reasoning_content empty and reasoning summary deltas
unfiring even with summary='auto'. Observed live with Azure gpt-5.2:
identical requests yielded 0 thinking chunks on some calls and 200+ on
others, purely model nondeterminism.

The wrapper is the right place to set this: it already routes
reasoning-model traffic to the Responses API, so it knows when reasoning
is the expected mode. Callers can still override (e.g. effort='low' for
cheaper turns).

Both _chat_via_responses and _stream_responses now setdefault:
  effort = "medium"
  summary = "auto"  (already there)

Tests: 3 new cases covering default behavior, effort override, and
summary override. 129/129 unit tests pass.

Live verification (Azure gpt-5.2 via DanaCodingAgent): wrapper now
deterministically returns 967 chars of reasoning_content. Tracking
confirms the wrapper-side issue is fully fixed; remaining agent-side
issue (star_agent.py:750-757 drops reasoning on direct-answer turns)
is a separate concern not addressed here.

Adds scripts/verify-thinking-persisted-via-coding-agent.py for
end-to-end inspection of timeline persistence.
When the model answered without invoking a tool, _record_think_results
took the no-tool-calls branch (star_agent.py:750) which only added
AGENT_RESPONSE — silently dropping the parsed reasoning. The else branch
(tool-calls path) already added AGENT_THOUGHTS for non-empty reasoning;
the no-tool-calls branch was the asymmetric outlier.

This dropped reasoning text on:
  - direct-answer turns (model solves the puzzle without tools)
  - the final turn of any tool-using session (after tools, model answers)

Affects all reasoning surfaces routed through codec_with_native_tool_use:
  - LLMResponse.reasoning_content (gpt-5/o3/o4 via Responses API,
    DeepSeek-R1, future Anthropic extended thinking)
  - <thinking> XML tags
  - JSON {"reasoning": ...} fields

Live verification (Azure gpt-5.2 via DanaCodingAgent):
  - direct scenario: 1373 chars reasoning persisted as AGENT_THOUGHTS
    (was 0)
  - tool scenario: AGENT_THOUGHTS entry now appears even on the final
    answer turn

For non-reasoning models, parsed.reasoning is None, the new block is
a no-op, behavior is unchanged.

129/129 unit tests pass.
Add OPENAI_THINKING_EFFORT and AZURE_THINKING_EFFORT env vars (plus
generic LLM_REASONING_EFFORT fallback) so operators can dial reasoning
cost/latency without code changes.

Precedence:
  1. caller's reasoning.effort kwarg
  2. provider env var (AZURE_THINKING_EFFORT / OPENAI_THINKING_EFFORT)
  3. LLM_REASONING_EFFORT
  4. "low"  (was "medium")

Default flipped to "low" so callers opt into deeper reasoning rather
than paying for medium implicitly. Invalid env values are logged and
ignored. Applies to both _chat_via_responses and _stream_responses.
@ngoclam9415 ngoclam9415 merged commit 68cbad3 into develop May 10, 2026
1 check failed
@ngoclam9415 ngoclam9415 deleted the fix/azure-gpt5-thinking-blocks branch May 10, 2026 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant