feat(anthropic): conform instrumentation to OTel GenAI semantic conventions#3808
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughSwitches Anthropic instrumentation from legacy flat gen_ai.prompt/completion attributes to structured GenAI semantic attributes (gen_ai.input.messages, gen_ai.output.messages, gen_ai.system_instructions, gen_ai.tool.definitions, gen_ai.response.finish_reasons), uses GenAi enum values for provider/system/operation attributes, and updates tests and version. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
packages/opentelemetry-instrumentation-anthropic/tests/test_thinking.py (1)
52-56: Consider handling missing role gracefully.Using
next(m for m in output_messages if m.get("role") == "thinking")will raiseStopIterationif no thinking message exists. While this is likely intentional to fail the test, adding a default or explicit assertion message would improve test diagnostics:thinking_msg = next((m for m in output_messages if m.get("role") == "thinking"), None) assert thinking_msg is not None, "Expected thinking message not found in output"This is a minor improvement for test clarity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/opentelemetry-instrumentation-anthropic/tests/test_thinking.py` around lines 52 - 56, Replace the use of next(...) without a default when locating the thinking message to avoid StopIteration; in the test reading output_messages from anthropic_span.attributes[GenAIAttributes.GEN_AI_OUTPUT_MESSAGES], use next((m for m in output_messages if m.get("role") == "thinking"), None) to assign thinking_msg and then assert thinking_msg is not None with a clear message (e.g., "Expected thinking message not found in output") before checking thinking_msg["content"]; do the same defensive pattern or clear assertions for assistant_msg if desired.packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py (1)
92-97: Consider extracting repeated span-message parsing/assertion helpers.The same parse/assert pattern appears in many tests, which increases maintenance cost for future semconv shape updates.
Refactor sketch
+def assert_io_messages(span, expected_input_role, expected_input_content, expected_output_role, expected_output_content): + input_messages = json.loads(span.attributes[GenAIAttributes.GEN_AI_INPUT_MESSAGES]) + output_messages = json.loads(span.attributes[GenAIAttributes.GEN_AI_OUTPUT_MESSAGES]) + assert input_messages[0]["role"] == expected_input_role + assert input_messages[0]["content"] == expected_input_content + assert output_messages[-1]["role"] == expected_output_role + assert output_messages[-1]["content"] == expected_output_content🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py` around lines 92 - 97, Extract a small test helper to centralize parsing and assertions for span message attributes used across tests: create a utility (e.g., parse_span_messages or assert_span_message) that accepts a span and an attribute key (GenAIAttributes.GEN_AI_INPUT_MESSAGES or GenAIAttributes.GEN_AI_OUTPUT_MESSAGES) and returns the parsed JSON list or asserts the expected content/role against a provided expected message; then replace the repeated json.loads and assertions in tests referencing anthropic_span and response with calls to this helper to reduce duplication and make future semconv shape updates easier to change in one place.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/span_utils.py`:
- Around line 78-80: The GEN_AI_REQUEST_MAX_TOKENS attribute currently prefers
the legacy parameter by calling kwargs.get("max_tokens_to_sample") or
kwargs.get("max_tokens"); change this to prefer the Messages API by using
kwargs.get("max_tokens") or kwargs.get("max_tokens_to_sample") when calling
set_span_attribute (referencing set_span_attribute and
GenAIAttributes.GEN_AI_REQUEST_MAX_TOKENS and the kwargs keys) so max_tokens
takes precedence with max_tokens_to_sample as the fallback.
In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py`:
- Around line 1567-1573: The test validates assistant history but misses an
explicit regression guard to ensure tool-use isn't duplicated between message
content and tool_calls; update the test around anthropic_span and
GenAIAttributes.GEN_AI_INPUT_MESSAGES to parse input_messages and the associated
tool_calls (from
anthropic_span.attributes[GenAIAttributes.GEN_AI_INPUT_MESSAGES] or nearby
attribute used in the test) and add an assertion that no assistant message's
content string(s) (e.g., msg1_content or input_messages[n]["content"]) appear
again in the corresponding tool_calls entries—i.e., explicitly assert that for
assistant-role messages (input_messages[i]["role"] == "assistant") the textual
content is not duplicated in the parsed tool_calls list.
---
Nitpick comments:
In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py`:
- Around line 92-97: Extract a small test helper to centralize parsing and
assertions for span message attributes used across tests: create a utility
(e.g., parse_span_messages or assert_span_message) that accepts a span and an
attribute key (GenAIAttributes.GEN_AI_INPUT_MESSAGES or
GenAIAttributes.GEN_AI_OUTPUT_MESSAGES) and returns the parsed JSON list or
asserts the expected content/role against a provided expected message; then
replace the repeated json.loads and assertions in tests referencing
anthropic_span and response with calls to this helper to reduce duplication and
make future semconv shape updates easier to change in one place.
In `@packages/opentelemetry-instrumentation-anthropic/tests/test_thinking.py`:
- Around line 52-56: Replace the use of next(...) without a default when
locating the thinking message to avoid StopIteration; in the test reading
output_messages from
anthropic_span.attributes[GenAIAttributes.GEN_AI_OUTPUT_MESSAGES], use next((m
for m in output_messages if m.get("role") == "thinking"), None) to assign
thinking_msg and then assert thinking_msg is not None with a clear message
(e.g., "Expected thinking message not found in output") before checking
thinking_msg["content"]; do the same defensive pattern or clear assertions for
assistant_msg if desired.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 55b633d3-e07f-4128-9a6c-7ee9773fb151
⛔ Files ignored due to path filters (1)
packages/opentelemetry-instrumentation-anthropic/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (9)
packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/__init__.pypackages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/span_utils.pypackages/opentelemetry-instrumentation-anthropic/tests/test_bedrock_with_raw_response.pypackages/opentelemetry-instrumentation-anthropic/tests/test_completion.pypackages/opentelemetry-instrumentation-anthropic/tests/test_messages.pypackages/opentelemetry-instrumentation-anthropic/tests/test_prompt_caching.pypackages/opentelemetry-instrumentation-anthropic/tests/test_semconv_span_attrs.pypackages/opentelemetry-instrumentation-anthropic/tests/test_structured_outputs.pypackages/opentelemetry-instrumentation-anthropic/tests/test_thinking.py
| set_span_attribute( | ||
| span, GenAIAttributes.GEN_AI_REQUEST_MAX_TOKENS, kwargs.get("max_tokens_to_sample") | ||
| span, GenAIAttributes.GEN_AI_REQUEST_MAX_TOKENS, kwargs.get("max_tokens_to_sample") or kwargs.get("max_tokens") | ||
| ) |
There was a problem hiding this comment.
max_tokens precedence is inverted from PR description.
The PR description states: "gen_ai.request.max_tokens now reads max_tokens (Messages API) with fallback to max_tokens_to_sample (legacy Completions API)". However, the code uses max_tokens_to_sample or max_tokens, which prioritizes the legacy parameter.
For the Messages API (which is the primary/modern API), max_tokens should take precedence:
🔧 Proposed fix
set_span_attribute(
- span, GenAIAttributes.GEN_AI_REQUEST_MAX_TOKENS, kwargs.get("max_tokens_to_sample") or kwargs.get("max_tokens")
+ span, GenAIAttributes.GEN_AI_REQUEST_MAX_TOKENS, kwargs.get("max_tokens") or kwargs.get("max_tokens_to_sample")
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| set_span_attribute( | |
| span, GenAIAttributes.GEN_AI_REQUEST_MAX_TOKENS, kwargs.get("max_tokens_to_sample") | |
| span, GenAIAttributes.GEN_AI_REQUEST_MAX_TOKENS, kwargs.get("max_tokens_to_sample") or kwargs.get("max_tokens") | |
| ) | |
| set_span_attribute( | |
| span, GenAIAttributes.GEN_AI_REQUEST_MAX_TOKENS, kwargs.get("max_tokens") or kwargs.get("max_tokens_to_sample") | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@packages/opentelemetry-instrumentation-anthropic/opentelemetry/instrumentation/anthropic/span_utils.py`
around lines 78 - 80, The GEN_AI_REQUEST_MAX_TOKENS attribute currently prefers
the legacy parameter by calling kwargs.get("max_tokens_to_sample") or
kwargs.get("max_tokens"); change this to prefer the Messages API by using
kwargs.get("max_tokens") or kwargs.get("max_tokens_to_sample") when calling
set_span_attribute (referencing set_span_attribute and
GenAIAttributes.GEN_AI_REQUEST_MAX_TOKENS and the kwargs keys) so max_tokens
takes precedence with max_tokens_to_sample as the fallback.
| input_messages = json.loads(anthropic_span.attributes[GenAIAttributes.GEN_AI_INPUT_MESSAGES]) | ||
| assert input_messages[0]["content"] == "What is the weather and current time in San Francisco?" | ||
| assert input_messages[0]["role"] == "user" | ||
| msg1_content = json.loads(input_messages[1]["content"]) | ||
| assert msg1_content[0]["text"] == "I'll help you get the weather and current time in San Francisco." | ||
| assert input_messages[1]["role"] == "assistant" | ||
| assert json.loads(input_messages[2]["content"]) == [ |
There was a problem hiding this comment.
Add an explicit regression assertion for non-duplicated tool-use in input history.
This block validates assistant history content but does not explicitly assert that tool-use is not duplicated between content and tool_calls in input messages (one of the PR bugfix goals). Add a direct guard here.
Suggested assertion add-on
msg1_content = json.loads(input_messages[1]["content"])
assert msg1_content[0]["text"] == "I'll help you get the weather and current time in San Francisco."
+ tool_use_blocks = [b for b in msg1_content if b.get("type") == "tool_use"]
+ assert len(tool_use_blocks) == 1
+ assert "tool_calls" not in input_messages[1]
assert input_messages[1]["role"] == "assistant"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py`
around lines 1567 - 1573, The test validates assistant history but misses an
explicit regression guard to ensure tool-use isn't duplicated between message
content and tool_calls; update the test around anthropic_span and
GenAIAttributes.GEN_AI_INPUT_MESSAGES to parse input_messages and the associated
tool_calls (from
anthropic_span.attributes[GenAIAttributes.GEN_AI_INPUT_MESSAGES] or nearby
attribute used in the test) and add an assertion that no assistant message's
content string(s) (e.g., msg1_content or input_messages[n]["content"]) appear
again in the corresponding tool_calls entries—i.e., explicitly assert that for
assistant-role messages (input_messages[i]["role"] == "assistant") the textual
content is not duplicated in the parsed tool_calls list.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py (1)
1567-1579:⚠️ Potential issue | 🟡 MinorAdd regression assertion for tool_calls on assistant message with history.
The test validates the assistant message's
contentis correctly extracted as text, but per the PR's bug fix (tool-use blocks no longer duplicated), the test should also verify thattool_callsexists oninput_messages[1]with the extracted tool_use data.Suggested assertion add-on
assert input_messages[1]["content"] == "I'll help you get the weather and current time in San Francisco." assert input_messages[1]["role"] == "assistant" + # Verify tool_use block is captured in tool_calls (not duplicated in content) + assert "tool_calls" in input_messages[1] + assert len(input_messages[1]["tool_calls"]) == 1 + assert input_messages[1]["tool_calls"][0]["id"] == "call_1" + assert input_messages[1]["tool_calls"][0]["name"] == "get_weather" assert json.loads(input_messages[2]["content"]) == [🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py` around lines 1567 - 1579, Add an assertion that the assistant message in input_messages (the second element, input_messages[1]) contains a tool_calls field with the extracted tool use data from the tool-use block; locate where input_messages is derived from anthropic_span.attributes[GenAIAttributes.GEN_AI_INPUT_MESSAGES] and add a check that input_messages[1]["tool_calls"] (or its JSON-decoded equivalent if stored as a string) matches the expected tool_use object (e.g., type/tool_use_id/content matching the previous tool result).
🧹 Nitpick comments (1)
packages/opentelemetry-instrumentation-anthropic/tests/test_semconv_span_attrs.py (1)
439-495: Consider extracting repeated span capture pattern (optional refactor).The span identity tests (
test_gen_ai_system_value_is_lowercase_anthropic,test_gen_ai_provider_name_is_set,test_gen_ai_operation_name_*,test_awrap_*) share identical mocking boilerplate. A helper or fixture could reduce duplication.Example helper extraction
def capture_span_attributes(to_wrap, kwargs, use_awrap=False): """Helper to capture span attributes from _wrap/_awrap calls.""" from unittest.mock import patch, MagicMock, AsyncMock from opentelemetry.instrumentation.anthropic import _wrap, _awrap tracer = MagicMock() captured = {} def fake_start_span(name, kind, attributes): captured["attributes"] = attributes span = MagicMock() span.is_recording.return_value = False return span tracer.start_span.side_effect = fake_start_span wrapped_fn = AsyncMock(return_value=None) if use_awrap else MagicMock(return_value=None) with patch("opentelemetry.context.get_value", return_value=False): fn = (_awrap if use_awrap else _wrap)(tracer, None, None, None, None, None, to_wrap) if use_awrap: import asyncio asyncio.run(fn(wrapped_fn, MagicMock(), [], kwargs)) else: fn(wrapped_fn, MagicMock(), [], kwargs) return captured["attributes"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/opentelemetry-instrumentation-anthropic/tests/test_semconv_span_attrs.py` around lines 439 - 495, Several tests repeat the same tracer/mock/start_span boilerplate; extract that into a single helper (e.g. capture_span_attributes) and call it from test_gen_ai_system_value_is_lowercase_anthropic, test_gen_ai_provider_name_is_set and the related test_gen_ai_operation_name_* and test_awrap_* tests to remove duplication; the helper should create the MagicMock tracer, install the fake_start_span side_effect to capture attributes, patch opentelemetry.context.get_value, invoke _wrap or _awrap with the provided to_wrap and kwargs, and return the captured attributes so each test just asserts on the returned dict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@packages/opentelemetry-instrumentation-anthropic/tests/test_messages.py`:
- Around line 1567-1579: Add an assertion that the assistant message in
input_messages (the second element, input_messages[1]) contains a tool_calls
field with the extracted tool use data from the tool-use block; locate where
input_messages is derived from
anthropic_span.attributes[GenAIAttributes.GEN_AI_INPUT_MESSAGES] and add a check
that input_messages[1]["tool_calls"] (or its JSON-decoded equivalent if stored
as a string) matches the expected tool_use object (e.g.,
type/tool_use_id/content matching the previous tool result).
---
Nitpick comments:
In
`@packages/opentelemetry-instrumentation-anthropic/tests/test_semconv_span_attrs.py`:
- Around line 439-495: Several tests repeat the same tracer/mock/start_span
boilerplate; extract that into a single helper (e.g. capture_span_attributes)
and call it from test_gen_ai_system_value_is_lowercase_anthropic,
test_gen_ai_provider_name_is_set and the related test_gen_ai_operation_name_*
and test_awrap_* tests to remove duplication; the helper should create the
MagicMock tracer, install the fake_start_span side_effect to capture attributes,
patch opentelemetry.context.get_value, invoke _wrap or _awrap with the provided
to_wrap and kwargs, and return the captured attributes so each test just asserts
on the returned dict.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b50caf51-6c87-4c24-be9d-cac9bcb0ec37
📒 Files selected for processing (2)
packages/opentelemetry-instrumentation-anthropic/tests/test_messages.pypackages/opentelemetry-instrumentation-anthropic/tests/test_semconv_span_attrs.py
47d522d to
547b275
Compare
547b275 to
324da4c
Compare
e430fd6 to
0141f4c
Compare
8fe73f1 to
0e5f2d7
Compare
740ee81 to
6401889
Compare
7d1116a to
cb0feb9
Compare
| GenAIAttributes.GEN_AI_SYSTEM: "Anthropic", | ||
| SpanAttributes.LLM_REQUEST_TYPE: LLMRequestTypeValues.COMPLETION.value, | ||
| GenAIAttributes.GEN_AI_SYSTEM: GenAiSystemValues.ANTHROPIC.value, | ||
| GenAIAttributes.GEN_AI_PROVIDER_NAME: GenAiSystemValues.ANTHROPIC.value, |
There was a problem hiding this comment.
@max-deygin-traceloop and @OzBenSimhonTraceloop , here a reference for both sets (system and provider name)
…→ GEN_AI_ - Import GEN_AI_REQUEST_FREQUENCY/PRESENCE_PENALTY from upstream gen_ai_attributes - Use SpanAttributes.GEN_AI_USAGE_TOTAL_TOKENS/IS_STREAMING/RESPONSE_FINISH_REASON/ RESPONSE_STOP_REASON (renamed from LLM_*) - Remove duplicate LLM_REQUEST_TYPE dict entry (operation_name var already handles it) - Update test_messages.py and test_bedrock_with_raw_response.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pan_attrs.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- llm.usage.total_tokens → gen_ai.usage.total_tokens - gen_ai.usage.cache_creation_input_tokens → gen_ai.usage.cache_creation.input_tokens - gen_ai.usage.cache_read_input_tokens → gen_ai.usage.cache_read.input_tokens Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cb0feb9 to
53ba7f1
Compare
What
Refactors the Anthropic instrumentation package to emit span attributes that comply with the
OpenTelemetry GenAI semantic conventions spec.
Changes
New attributes (replaces legacy flat gen_ai.prompt.* / gen_ai.completion.* keys):
setting
Bug fixes:
max_tokens_to_sample (legacy Completions API)
inputs without double-encoding
Checklist
I have added tests that cover my changes.
If adding a new instrumentation or changing an existing one, I've added screenshots from some
observability platform showing the change.
PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
(If applicable) I have updated the documentation accordingly.
Validated locally against a running agent using ConsoleSpanExporter.
Confirmed all appear correctly on
anthropic.chat spans:Summary by CodeRabbit
New Features
Tests
Chores