-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Description
When using Agent.run(stream=True) with tools, the FunctionInvocationLayer's streaming loop discards the LLM's reasoning text from intermediate iterations. Only function_call content is yielded during tool-calling iterations; the assistant text that accompanies those tool calls is silently dropped. Text content only streams from the final iteration (when no tool calls are present).
Current behavior:
Iteration 1 (LLM returns text + tool_call):
Streamed to consumer: function_call chunks only
Discarded: assistant reasoning text
Iteration 2 (LLM returns text + tool_call):
Streamed to consumer: function_call chunks only
Discarded: assistant reasoning text
Iteration 3 (LLM returns text only, no tool calls):
Streamed to consumer: text chunks ✅
Expected behavior:
Iteration 1 (LLM returns text + tool_call):
Streamed: text chunks ("I'll search the data lake...")
Streamed: function_call chunks
→ framework executes tool
Iteration 2 (LLM returns text + tool_call):
Streamed: text chunks ("Found the dataset. Loading it now...")
Streamed: function_call chunks
→ framework executes tool
Iteration 3 (LLM returns text only):
Streamed: text chunks (final response)
Real-time reasoning traces would provide transparency into why each tool is being called, similar to how ChatGPT and Claude show "thinking" text.
The ChatMiddleware documentation says it intercepts individual chat client calls. In practice, when used with AzureOpenAIChatClient + Agent.run(stream=True), the ChatMiddlewareLayer sits above FunctionInvocationLayer in the MRO, so it wraps the entire tool-calling loop as a single call rather than intercepting each individual LLM call within the loop.
Code Sample
Language/SDK
Both