Python: Emit tool call events in GitHubCopilotAgent streaming#4711
Python: Emit tool call events in GitHubCopilotAgent streaming#4711moonbox3 merged 4 commits intomicrosoft:mainfrom
Conversation
_stream_updates now yields FunctionCallContent for TOOL_EXECUTION_START and FunctionResultContent for TOOL_EXECUTION_COMPLETE events from the Copilot SDK session. This enables DevUI and other consumers to display tool calls during streaming agent execution. Previously only ASSISTANT_MESSAGE_DELTA, SESSION_IDLE, and SESSION_ERROR were handled — tool execution events were silently dropped. Signed-off-by: James Sturtevant <jsturtevant@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR extends the GitHub Copilot Python agent’s streaming event handling to surface tool execution start/complete events as AgentResponseUpdate chunks, enabling downstream consumers to observe tool calls/results during a streamed run.
Changes:
- Convert
SessionEventType.TOOL_EXECUTION_STARTintoContent.from_function_call(...)streaming updates. - Convert
SessionEventType.TOOL_EXECUTION_COMPLETEintoContent.from_function_result(...)streaming updates.
python/packages/github_copilot/agent_framework_github_copilot/_agent.py
Outdated
Show resolved
Hide resolved
Signed-off-by: James Sturtevant <jsturtevant@gmail.com>
Signed-off-by: James Sturtevant <jsturtevant@gmail.com>
|
@jsturtevant please link an issue to this PR. |
|
moonbox3
left a comment
There was a problem hiding this comment.
Automated Code Review
Reviewers: 4 | Confidence: 83%
✓ Correctness
The new event handlers for TOOL_EXECUTION_START and TOOL_EXECUTION_COMPLETE are structurally sound and follow the existing pattern for other event types. The fallback logic using getattr with defaults is reasonable. One potential correctness concern: if result_obj exists but its text_result_for_llm attribute is explicitly None, the result passed to Content.from_function_result will be None rather than the empty string the code appears to intend. The tests are thorough and cover the main paths including missing fields and failure cases. Assuming ToolResult and the new SessionEventType values are imported/defined elsewhere (not shown in the diff), the tests should work. The import of ToolResult should be verified in the test file.
✓ Security Reliability
The diff adds handling for TOOL_EXECUTION_START and TOOL_EXECUTION_COMPLETE session events, mapping them to function_call and function_result content updates. The implementation uses defensive getattr calls with sensible defaults, follows existing patterns in the codebase, and includes thorough test coverage for happy paths and edge cases. No security or reliability issues were identified.
✗ Test Coverage
The new tests cover the happy paths for TOOL_EXECUTION_START and TOOL_EXECUTION_COMPLETE, missing fields on START, None result on COMPLETE, failure result type, and a full interleaved sequence. Coverage is generally solid. However, there is likely a missing import for
ToolResultin the test file (the diff does not show it being added to imports), and there is no test for the TOOL_EXECUTION_COMPLETE handler whenevent.dataitself has missing fields (the missing-fields test only covers the START event type). Additionally, there is no test verifying that a success result with anerrorfield does NOT propagate the error as an exception, which is an important edge case of theexception = error if result_type == "failure" else Nonelogic.
✗ Design Approach
The TOOL_EXECUTION_COMPLETE handler is built on incorrect assumptions about the SDK's type system, making it silently broken in production. The
event.data.resultfield is typed ascopilot.generated.session_events.Result(a dataclass withcontent: str,contents, anddetailed_contentfields), notcopilot.types.ToolResult. The diff accesses.text_result_for_llm,.result_type, and.erroron this object — none of which exist onsession_events.Result— soresult_textwill always be""and failure detection will never fire. Failure information (success: Optional[bool],error: Union[ErrorClass, str, None]) lives directly onevent.data, not insideevent.data.result. The tests mask this by mockingtool_event_data.resultwith acopilot.types.ToolResultTypedDict and usinggetattron it; since TypedDicts are plain dicts and don't expose keys as attributes,getattr(result_obj, "text_result_for_llm", "")also always returns""— meaning the assertionassert content.result == "Sunny, 72°F"should actually fail, indicating the tests themselves are not passing against the real code path.
Flagged Issues
-
event.data.resultiscopilot.generated.session_events.Result(fields:content,contents,detailed_content), NOTcopilot.types.ToolResult. The handler accesses.text_result_for_llm,.result_type, and.error— none of which exist onResult— soresult_textwill always be""and failure detection will never fire. Useresult_obj.contentfor the result text, and read failure state fromevent.data.success(Optional[bool]) andevent.data.error(Union[ErrorClass, str, None]) directly. - The tests mock
tool_event_data.resultwithToolResult(...)(a TypedDict, i.e. a plain dict) and the implementation callsgetattr(result_obj, "text_result_for_llm", "")on it.getattron a dict does not expose keys as attributes, so this always returns"". The assertionassert content.result == "Sunny, 72°F"should fail, meaning the tests are not actually validating the implementation. Additionally,ToolResultis used (lines 514, 636, 688) but no import is shown in the diff — though the correct fix is to usesession_events.Resultinstead.
Suggestions
- Add a test for TOOL_EXECUTION_COMPLETE with missing fields on
event.data(analogous to the existing missing-fields test which only covers START), to exercise thegetattrfallback paths fortool_call_idandresultin the COMPLETE handler. - Add a test for TOOL_EXECUTION_COMPLETE where the result is successful but an
errorfield is present, to validate that the exception guard correctly returnsNone. - When correcting the field access, consider that
ToolResultTypeincludes"rejected"and"denied"in addition to"failure"— all are non-success states and should likely surface the error. With the correctedevent.data.successapproach, ensure all failure conditions are handled.
Automated review by moonbox3's agents
python/packages/github_copilot/agent_framework_github_copilot/_agent.py
Outdated
Show resolved
Hide resolved
python/packages/github_copilot/agent_framework_github_copilot/_agent.py
Outdated
Show resolved
Hide resolved
python/packages/github_copilot/tests/test_github_copilot_agent.py
Outdated
Show resolved
Hide resolved
- Read result text from session_events.Result.content (not ToolResult.text_result_for_llm) - Read failure state from event.data.success/error (not result_obj.result_type/error) - Handle ErrorClass.message and plain string errors - Update tests to use session_events.Result and ErrorClass - Add tests for string errors, success-with-error, and COMPLETE missing fields Signed-off-by: James Sturtevant <jsturtevant@gmail.com>
|
@moonbox3 I've addressed the comments |
Motivation and Context
_stream_updates now yields FunctionCallContent for TOOL_EXECUTION_START and FunctionResultContent for TOOL_EXECUTION_COMPLETE events from the Copilot SDK session. This enables DevUI and other consumers to display tool calls during streaming agent execution. Previously only ASSISTANT_MESSAGE_DELTA, SESSION_IDLE, and SESSION_ERROR were handled — tool execution events were silently dropped.
Description
Contribution Checklist