Skip to content

Python: Emit tool call events in GitHubCopilotAgent streaming#4711

Merged
moonbox3 merged 4 commits intomicrosoft:mainfrom
jsturtevant:copilot-tools
Mar 20, 2026
Merged

Python: Emit tool call events in GitHubCopilotAgent streaming#4711
moonbox3 merged 4 commits intomicrosoft:mainfrom
jsturtevant:copilot-tools

Conversation

@jsturtevant
Copy link
Contributor

@jsturtevant jsturtevant commented Mar 15, 2026

Motivation and Context

_stream_updates now yields FunctionCallContent for TOOL_EXECUTION_START and FunctionResultContent for TOOL_EXECUTION_COMPLETE events from the Copilot SDK session. This enables DevUI and other consumers to display tool calls during streaming agent execution. Previously only ASSISTANT_MESSAGE_DELTA, SESSION_IDLE, and SESSION_ERROR were handled — tool execution events were silently dropped.

Description

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

_stream_updates now yields FunctionCallContent for TOOL_EXECUTION_START
and FunctionResultContent for TOOL_EXECUTION_COMPLETE events from the
Copilot SDK session. This enables DevUI and other consumers to display
tool calls during streaming agent execution. Previously only ASSISTANT_MESSAGE_DELTA, SESSION_IDLE, and SESSION_ERROR
were handled — tool execution events were silently dropped.

Signed-off-by: James Sturtevant <jsturtevant@gmail.com>
Copilot AI review requested due to automatic review settings March 15, 2026 20:07
@github-actions github-actions bot changed the title Emit tool call events in GitHubCopilotAgent streaming Python: Emit tool call events in GitHubCopilotAgent streaming Mar 15, 2026
@markwallace-microsoft
Copy link
Member

markwallace-microsoft commented Mar 15, 2026

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
TOTAL24006264089% 
report-only-changed-files is enabled. No files were changed during this commit :)

Python Unit Test Overview

Tests Skipped Failures Errors Time
5245 20 💤 0 ❌ 0 🔥 1m 23s ⏱️

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the GitHub Copilot Python agent’s streaming event handling to surface tool execution start/complete events as AgentResponseUpdate chunks, enabling downstream consumers to observe tool calls/results during a streamed run.

Changes:

  • Convert SessionEventType.TOOL_EXECUTION_START into Content.from_function_call(...) streaming updates.
  • Convert SessionEventType.TOOL_EXECUTION_COMPLETE into Content.from_function_result(...) streaming updates.

Signed-off-by: James Sturtevant <jsturtevant@gmail.com>
Signed-off-by: James Sturtevant <jsturtevant@gmail.com>
@moonbox3
Copy link
Contributor

@jsturtevant please link an issue to this PR.

@jsturtevant
Copy link
Contributor Author

@jsturtevant please link an issue to this PR.

#4734

Copy link
Contributor

@moonbox3 moonbox3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 83%

✓ Correctness

The new event handlers for TOOL_EXECUTION_START and TOOL_EXECUTION_COMPLETE are structurally sound and follow the existing pattern for other event types. The fallback logic using getattr with defaults is reasonable. One potential correctness concern: if result_obj exists but its text_result_for_llm attribute is explicitly None, the result passed to Content.from_function_result will be None rather than the empty string the code appears to intend. The tests are thorough and cover the main paths including missing fields and failure cases. Assuming ToolResult and the new SessionEventType values are imported/defined elsewhere (not shown in the diff), the tests should work. The import of ToolResult should be verified in the test file.

✓ Security Reliability

The diff adds handling for TOOL_EXECUTION_START and TOOL_EXECUTION_COMPLETE session events, mapping them to function_call and function_result content updates. The implementation uses defensive getattr calls with sensible defaults, follows existing patterns in the codebase, and includes thorough test coverage for happy paths and edge cases. No security or reliability issues were identified.

✗ Test Coverage

The new tests cover the happy paths for TOOL_EXECUTION_START and TOOL_EXECUTION_COMPLETE, missing fields on START, None result on COMPLETE, failure result type, and a full interleaved sequence. Coverage is generally solid. However, there is likely a missing import for ToolResult in the test file (the diff does not show it being added to imports), and there is no test for the TOOL_EXECUTION_COMPLETE handler when event.data itself has missing fields (the missing-fields test only covers the START event type). Additionally, there is no test verifying that a success result with an error field does NOT propagate the error as an exception, which is an important edge case of the exception = error if result_type == "failure" else None logic.

✗ Design Approach

The TOOL_EXECUTION_COMPLETE handler is built on incorrect assumptions about the SDK's type system, making it silently broken in production. The event.data.result field is typed as copilot.generated.session_events.Result (a dataclass with content: str, contents, and detailed_content fields), not copilot.types.ToolResult. The diff accesses .text_result_for_llm, .result_type, and .error on this object — none of which exist on session_events.Result — so result_text will always be "" and failure detection will never fire. Failure information (success: Optional[bool], error: Union[ErrorClass, str, None]) lives directly on event.data, not inside event.data.result. The tests mask this by mocking tool_event_data.result with a copilot.types.ToolResult TypedDict and using getattr on it; since TypedDicts are plain dicts and don't expose keys as attributes, getattr(result_obj, "text_result_for_llm", "") also always returns "" — meaning the assertion assert content.result == "Sunny, 72°F" should actually fail, indicating the tests themselves are not passing against the real code path.

Flagged Issues

  • event.data.result is copilot.generated.session_events.Result (fields: content, contents, detailed_content), NOT copilot.types.ToolResult. The handler accesses .text_result_for_llm, .result_type, and .error — none of which exist on Result — so result_text will always be "" and failure detection will never fire. Use result_obj.content for the result text, and read failure state from event.data.success (Optional[bool]) and event.data.error (Union[ErrorClass, str, None]) directly.
  • The tests mock tool_event_data.result with ToolResult(...) (a TypedDict, i.e. a plain dict) and the implementation calls getattr(result_obj, "text_result_for_llm", "") on it. getattr on a dict does not expose keys as attributes, so this always returns "". The assertion assert content.result == "Sunny, 72°F" should fail, meaning the tests are not actually validating the implementation. Additionally, ToolResult is used (lines 514, 636, 688) but no import is shown in the diff — though the correct fix is to use session_events.Result instead.

Suggestions

  • Add a test for TOOL_EXECUTION_COMPLETE with missing fields on event.data (analogous to the existing missing-fields test which only covers START), to exercise the getattr fallback paths for tool_call_id and result in the COMPLETE handler.
  • Add a test for TOOL_EXECUTION_COMPLETE where the result is successful but an error field is present, to validate that the exception guard correctly returns None.
  • When correcting the field access, consider that ToolResultType includes "rejected" and "denied" in addition to "failure" — all are non-success states and should likely surface the error. With the corrected event.data.success approach, ensure all failure conditions are handled.

Automated review by moonbox3's agents

- Read result text from session_events.Result.content (not ToolResult.text_result_for_llm)
- Read failure state from event.data.success/error (not result_obj.result_type/error)
- Handle ErrorClass.message and plain string errors
- Update tests to use session_events.Result and ErrorClass
- Add tests for string errors, success-with-error, and COMPLETE missing fields

Signed-off-by: James Sturtevant <jsturtevant@gmail.com>
@jsturtevant
Copy link
Contributor Author

@moonbox3 I've addressed the comments

@moonbox3 moonbox3 added this pull request to the merge queue Mar 20, 2026
Merged via the queue into microsoft:main with commit b4c4f50 Mar 20, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python: [Bug]: GitHubCopilotAgent streaming silently drops tool execution events

5 participants