Skip to content

Python: avoid duplicate agent response telemetry#4685

Merged
eavanvalkenburg merged 4 commits intomicrosoft:mainfrom
eavanvalkenburg:copilot/issue-4675-duplicate-telemetry
Mar 20, 2026
Merged

Python: avoid duplicate agent response telemetry#4685
eavanvalkenburg merged 4 commits intomicrosoft:mainfrom
eavanvalkenburg:copilot/issue-4675-duplicate-telemetry

Conversation

@eavanvalkenburg
Copy link
Member

Motivation and Context

Nested agent runs currently record the same response ID and token usage on both the outer invoke_agent span and the inner chat completion span. That duplicates telemetry for a single response and makes span-level metrics noisier than they should be. Fixes #4675.

Description

  • stop AgentTelemetryLayer from attaching response_id and token usage to agent spans
  • keep response telemetry ownership on the inner chat span while preserving the rest of the agent span metadata
  • add regression coverage for nested agent/chat telemetry in streaming and non-streaming paths, plus helper coverage for suppressing response_id

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

Copilot AI review requested due to automatic review settings March 13, 2026 11:07
@markwallace-microsoft
Copy link
Member

markwallace-microsoft commented Mar 13, 2026

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
packages/core/agent_framework
   observability.py7383095%388–389, 416, 418–420, 423–425, 430–431, 437–438, 444–445, 735, 935–936, 1098, 1344–1345, 1597–1601, 1799, 1997, 2215, 2217
TOTAL27297322588% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
5335 20 💤 0 ❌ 0 🔥 1m 26s ⏱️

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Python observability to prevent duplicate response telemetry when an Agent run produces both an outer agent span and an inner chat completion span, ensuring response ownership (response id + token usage) stays on the chat span.

Changes:

  • Suppress gen_ai.response.id and gen_ai.usage.* attributes on invoke_agent spans.
  • Extend _get_response_attributes with a capture_response_id switch (in addition to capture_usage).
  • Add regression tests covering nested agent/chat telemetry for streaming and non-streaming paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
python/packages/core/agent_framework/observability.py Stops agent spans from capturing response id and token usage; adds capture_response_id option to response attribute extraction.
python/packages/core/tests/core/test_observability.py Updates agent-span assertions and adds regression tests to ensure response telemetry is only attached to chat spans.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Contributor

@TaoChenOSU TaoChenOSU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the correct thing to do? What if customers do want the response IDs in the agent spans.

The span names are different for agent and LLM: invoke_agent and chat. Is that not enough to dedup the data? @sphenry

@eavanvalkenburg
Copy link
Member Author

Is this the correct thing to do? What if customers do want the response IDs in the agent spans.

The span names are different for agent and LLM: invoke_agent and chat. Is that not enough to dedup the data? @sphenry

The issue here is that 1) you can have two response id's in a single agent span when you have function calling, and that would mean that one is set (the last one) at the agent level, so that is already wrong, and then 2) they apparently use response_id to do token count, and then having two places where that is set it counts double, you are right that they should be able to filter to only count at chat level, and not invoke_agent level, but that does mean that the response id at the agent level is still wrong (I think it does sum up the token count of all underlying responses), but if you have a mixed setup, where some agents are chat agents (with response id at both chat and invoke_agent spans) and some agents that only have invoke_agent spans, like A2A, then it will make their live a lot easier if they do not have to worry about this, they can just count up all usages where there is a response_id and go from there.

@sphenry
Copy link
Member

sphenry commented Mar 18, 2026

@TaoChenOSU Is there a scenario where they would want it in both?

@TaoChenOSU
Copy link
Contributor

Is this the correct thing to do? What if customers do want the response IDs in the agent spans.
The span names are different for agent and LLM: invoke_agent and chat. Is that not enough to dedup the data? @sphenry

The issue here is that 1) you can have two response id's in a single agent span when you have function calling, and that would mean that one is set (the last one) at the agent level, so that is already wrong, and then 2) they apparently use response_id to do token count, and then having two places where that is set it counts double, you are right that they should be able to filter to only count at chat level, and not invoke_agent level, but that does mean that the response id at the agent level is still wrong (I think it does sum up the token count of all underlying responses), but if you have a mixed setup, where some agents are chat agents (with response id at both chat and invoke_agent spans) and some agents that only have invoke_agent spans, like A2A, then it will make their live a lot easier if they do not have to worry about this, they can just count up all usages where there is a response_id and go from there.

Could you explain the first scenario further? That does sound like a bug.

For 2) why couldn't they just use data from the agent spans and not worry about the chat spans at all?

@TaoChenOSU
Copy link
Contributor

@TaoChenOSU Is there a scenario where they would want it in both?

I don't have a particular scenario, but I think we should record as much data as we can at each layer because customers rely on the traces to monitor applications. It's generally bad if we selectively drop things.

We should only care about recording the data, i.e. creating the spans and giving them the expected attributes. The application should take care of sending the data to a monitoring backend. Then the consumer of the data can decide how they want to use or parse the data.

@eavanvalkenburg eavanvalkenburg added this pull request to the merge queue Mar 19, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 19, 2026
@eavanvalkenburg eavanvalkenburg added this pull request to the merge queue Mar 19, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 19, 2026
@eavanvalkenburg eavanvalkenburg force-pushed the copilot/issue-4675-duplicate-telemetry branch from 55cc6e8 to 95dbff7 Compare March 19, 2026 15:38
@eavanvalkenburg eavanvalkenburg force-pushed the copilot/issue-4675-duplicate-telemetry branch from 95dbff7 to 2f506a1 Compare March 20, 2026 08:55
@eavanvalkenburg eavanvalkenburg force-pushed the copilot/issue-4675-duplicate-telemetry branch from 2f506a1 to c12300c Compare March 20, 2026 09:00
The invoke_agent span now carries the aggregated input/output token
counts from all inner chat completion spans that occur during an agent
run. Previously, when inner ChatTelemetryLayer spans captured usage,
the outer AgentTelemetryLayer skipped setting usage entirely to avoid
duplication. Now a new INNER_ACCUMULATED_USAGE context variable tracks
cumulative usage across all inner completions, and the agent span
always reports the total.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@eavanvalkenburg eavanvalkenburg added this pull request to the merge queue Mar 20, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 20, 2026
@eavanvalkenburg eavanvalkenburg added this pull request to the merge queue Mar 20, 2026
Merged via the queue into microsoft:main with commit 81e2336 Mar 20, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python: [Bug]: Duplicate LLM Telemetry Emission

7 participants