Skip to content

[Bug]: @trace_class on EventQueue generates excessive spans during LLM streaming (1500+/session) #1034

@tomoyukiy0

Description

@tomoyukiy0

What happened?

Problem

The @trace_class(kind=SpanKind.SERVER) decorator is applied to EventQueue and related high-frequency classes without an exclude_list:

  • a2a/server/events/event_queue.py: EventQueue (v0.3.x) / EventQueueLegacy (v1.0.x)
  • a2a/server/events/event_consumer.py: EventConsumer
  • a2a/server/events/in_memory_queue_manager.py: InMemoryQueueManager

This causes a span to be created for every call to high-frequency methods like:

  • EventQueue.enqueue_event — called once per streamed LLM token
  • EventQueue.dequeue_event — called once per streamed LLM token
  • EventQueue.task_done — called once per streamed LLM token

For a typical LLM streaming response of ~500 tokens, this generates 1500+ internal spans per session, most of which provide no actionable observability value since they represent fine-grained internal queue operations rather than meaningful request-level events.

Impact

  1. Breaks span-quota-limited systems — AWS Bedrock AgentCore Online Evaluation has a hard limit of 1000 spans per evaluated session. Sessions exceeding this limit are silently skipped, leaving evaluations unusable for any A2A-based agent doing non-trivial LLM streaming.

  2. Increased observability costs — CloudWatch Logs storage, network bandwidth, and memory overhead for spans that are mostly noise.

  3. Approaches span size quotas — A single session with many internal spans can approach the 15 MB/session span data limit.

Environment variable OTEL_INSTRUMENTATION_A2A_SDK_ENABLED=false is too coarse

The existing environment variable disables all A2A tracing including the useful RequestHandler-level spans. There is no way to selectively disable high-frequency internal spans.

Current workaround

Users can apply a runtime monkey-patch at application startup to unwrap the @trace_class decorator on the high-frequency classes, restoring the original methods via __wrapped__ (which is preserved by functools.wraps in trace_function). This is fragile and requires knowledge of internal SDK structure.

Proposed fix

Add an exclude_list (or equivalent) to the @trace_class application on the high-frequency classes. For example:

# a2a/server/events/event_queue.py

@trace_class(
    kind=SpanKind.SERVER,
    exclude_list=['enqueue_event', 'dequeue_event', 'task_done', 'clear_events'],
)
class EventQueue:
    ...

Similar changes for EventConsumer and InMemoryQueueManager. The high-frequency internal methods would no longer generate spans, while the class-level tracing decorator is preserved for any other methods that might be added in the future.

Verification

I have verified locally that:

  • The trace_class mechanism already supports exclude_list
  • Applying the fix reduces spans from 1500+ to ~53 per session (97% reduction)
  • Useful RequestHandler traces (DefaultRequestHandler, JSONRPCHandler, RESTHandler) and client transport traces are preserved

Happy to submit a PR with the proposed changes if this direction is acceptable.

Relevant log output

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

component: serverIssues related to frameworks for agent execution, HTTP/event handling, database persistence logic.

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions