Skip to content

feat(ARCH-341): add OpenTelemetry tracing and reduce INFO logging#41

Merged
emsearcy merged 9 commits intomainfrom
feat/ARCH-341-otel-tracing
Mar 25, 2026
Merged

feat(ARCH-341): add OpenTelemetry tracing and reduce INFO logging#41
emsearcy merged 9 commits intomainfrom
feat/ARCH-341-otel-tracing

Conversation

@emsearcy
Copy link
Contributor

@emsearcy emsearcy commented Mar 24, 2026

Summary

Implements ARCH-341: OTel instrumentation for the LFX MCP server.

Changes

internal/otel/tracing.go (new)

  • SetupSDK bootstraps OTLP trace/metric/log providers from standard OTEL_* env vars.
  • Falls back to a no-op provider when OTEL_TRACES_EXPORTER is unset — zero overhead in stdio/local-dev.
  • endpointURL helper normalises bare host:port values to http://host:port before passing to the SDK.
  • Custom OTEL error handler routes internal errors to slog.Error.

cmd/lfx-mcp-server/main.go

  • Wrap slog.NewJSONHandler with slogotel.OtelHandler so every log record within an active span carries trace_id and span_id (standard OTel hex fields; Datadog log-to-trace correlation uses a remapper rule on these directly — no dd.* conversion needed).
  • Initialise OTel SDK after logger setup, with graceful defer shutdown.
  • Wrap inbound HTTP mux with otelhttp.NewHandler for a root span + W3C TraceContext propagation per request.
  • Wrap outbound LFX API and token-exchange http.Client transports with otelhttp.NewTransport so upstream calls appear as child spans.
  • Rename mcpLoggingMiddlewaremcpOTelMiddleware to reflect its expanded role.
  • Add MCP semantic convention span attributes to mcpOTelMiddleware:
    • mcp.method.name, mcp.session.id — on every call.
    • network.transport"pipe" for stdio, "tcp" for HTTP (detected via RequestExtra.Header).
    • mcp.protocol.version — from the client's initialize params.
    • gen_ai.tool.name + gen_ai.operation.name = "execute_tool" — on tools/call spans.
    • error.type — error string on JSON-RPC-level failures; "tool_error" when a tool returns IsError = true.
  • Demote mcpOTelMiddleware successful completions from InfoDebug (reads are observable via APM spans, not log lines).

internal/tools/*.go

  • Add HTTPClient *http.Client field to all tool config structs (ProjectConfig, CommitteeConfig, MailingListConfig, MemberConfig, MeetingConfig) and thread the shared otelhttp-wrapped client through to lfxv2.ClientConfig.HTTPClient.
  • Convert all partial-result logger.Warn calls to in-band TextContent warnings prepended to the tool response Content slice:
    • get_project: settings unavailable warning.
    • get_committee: settings unavailable warning.
    • get_mailing_list_service / get_mailing_list: settings unavailable warnings.
    • All search tools (search_committees, search_committee_members, search_mailing_lists, search_mailing_list_members, search_meetings, search_meeting_registrants, search past meeting resources): page-truncation warnings.
    • Server-side Warn/Error retained for operator observability.

Helm chart (charts/lfx-mcp/)

  • deployment.yaml: add podAnnotations pass-through, app.otel env stanza (OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_INSECURE, OTEL_TRACES_EXPORTER, OTEL_TRACES_SAMPLE_RATIO), and app.extraEnv pass-through.
  • values.yaml: add podAnnotations: {}, app.otel stanza with empty defaults, and app.extraEnv: [].

go.mod

  • Add: slog-otel, otelhttp, full OTel SDK + OTLP exporters, jaeger propagator.
  • Bump: lfx-v2-committee-service → v0.2.23.

ArgoCD values

See companion draft PR in lfx-v2-argocd (ARCH-341) for HOST_IP extraEnv, otel endpoint, and Datadog pod log annotations.

Jira

ARCH-341

- Add internal/otel/tracing.go: SetupSDK initialises OTLP trace/metric/log
  providers from standard OTEL_* env vars; installs no-op provider when
  OTEL_TRACES_EXPORTER is unset (zero cost in stdio/local-dev).
- Wire slog-otel OtelHandler around slog.NewJSONHandler in main() so every
  log record emitted within an active span carries trace_id and span_id.
  Datadog log-to-trace correlation reads these standard OTEL hex fields
  directly (no dd.trace_id/dd.span_id conversion needed).
- Initialise OTel SDK in main() with graceful shutdown defer.
- Wrap inbound HTTP mux with otelhttp.NewHandler for root span per request
  with W3C TraceContext propagation.
- Wrap outbound LFX API and token-exchange HTTP clients with
  otelhttp.NewTransport so upstream calls appear as child spans.
- Demote mcpLoggingMiddleware success completions from INFO to DEBUG;
  these are observable via APM spans and INFO violates ADR-0002.
- Convert all logger.Warn partial-result calls to in-band TextContent
  warnings prepended to the tool response Content slice (get_project,
  get_committee, get_mailing_list_service, get_mailing_list, and all
  search-page truncation warnings). Server-side Warn/Error retained for
  operator observability.
- Add HTTPClient field to all tool config structs and thread the shared
  otelhttp-wrapped client through to lfxv2.ClientConfig.HTTPClient.
- Helm chart: add podAnnotations pass-through, app.otel stanza
  (endpoint/insecure/tracesExporter/tracesSampleRatio), and app.extraEnv
  pass-through to deployment.yaml and values.yaml.
- Bump go.mod: add slog-otel, otelhttp, otel SDK/exporters/propagators;
  bump lfx-v2-committee-service to v0.2.23.

🤖 Generated with [GitHub Copilot](https://github.com/features/copilot) (via Zed)

Signed-off-by: Eric Searcy <eric@linuxfoundation.org>
Copilot AI review requested due to automatic review settings March 24, 2026 22:18
Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review.

Tip: disable this comment in your organization's Code Review settings.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenTelemetry (OTLP) tracing/export setup and log-to-trace correlation, wires OTel HTTP instrumentation into inbound/outbound requests, and reduces INFO-level noise by moving partial-result warnings into tool responses and demoting routine logs.

Changes:

  • Add internal/otel SDK bootstrap (env-driven OTLP exporters, propagators, resource config) and initialize it from main.
  • Instrument HTTP server and shared outbound http.Client transports with otelhttp, and wrap JSON logging with slog-otel for trace_id/span_id.
  • Helm chart: add pod annotations passthrough and OTEL env var stanzas; tools now accept an injected HTTPClient.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
internal/otel/tracing.go New OTEL SDK bootstrap from env vars (trace/metric/log providers, propagators, endpoint normalization).
cmd/lfx-mcp-server/main.go Wrap slog handler for trace correlation, initialize/shutdown OTEL, instrument inbound/outbound HTTP, demote success logging to DEBUG.
internal/tools/project.go Inject shared HTTP client into LFX client config; return partial-result warning in tool response content.
internal/tools/committee.go Inject shared HTTP client; prepend in-band warnings for missing settings and page truncation cases.
internal/tools/committee_write.go Pass shared HTTP client into LFX client construction for write paths.
internal/tools/mailing_list.go Inject shared HTTP client; prepend in-band warnings for missing settings and page truncation cases.
internal/tools/member.go Add HTTP client to tool config and pass through to LFX clients.
internal/tools/member_write.go Pass shared HTTP client into LFX client construction for write paths.
internal/tools/meeting.go Inject shared HTTP client; prepend in-band page truncation warnings for search tools.
charts/lfx-mcp/values.yaml Add podAnnotations, app.otel, and app.extraEnv values.
charts/lfx-mcp/templates/deployment.yaml Render pod annotations; add OTEL-related env vars and extra env passthrough.
go.mod Add OTEL + slog-otel dependencies; bump committee service SDK.
go.sum Add checksums for new dependencies and bumped modules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The previous defer otelShutdown(context.Background()) had two problems:
- os.Exit(1) calls inside runHTTPServer would skip the defer entirely,
  losing any buffered spans.
- It used an unbounded context.Background() rather than the same
  10-second shutdown context already in scope.

Pass otelShutdown into runHTTPServer and runStdioServer. In both
functions call it explicitly after httpServer.Shutdown / server.Run
returns, using the existing bounded context so the flush honours the
same timeout as HTTP draining.

🤖 Generated with [GitHub Copilot](https://github.com/features/copilot) (via Zed)

Signed-off-by: Eric Searcy <eric@linuxfoundation.org>
Settings unavailability is a real error (network failure, decode
failure, unexpected server error) not merely a permissions notice.
Operator logs should reflect that. The in-band WARNING TextContent
in the tool response already informs the caller; the server-side log
level should be Error for operator alerting.

🤖 Generated with [GitHub Copilot](https://github.com/features/copilot) (via Zed)

Signed-off-by: Eric Searcy <eric@linuxfoundation.org>
…ogs and spans

Four fixes in this commit:

1. Context logger propagation: store a child logger with mcp.session.id
   and mcp.method.name pre-bound into ctx via tools.WithLogger so every
   newToolLogger(ctx, req) call inherits both fields on all log records,
   not just the two completion lines in mcpLoggingMiddleware.

2. Span attributes: use MCP semconv per the OTel MCP spec
   (https://opentelemetry.io/docs/specs/semconv/gen-ai/mcp/) instead of
   RPC semconv — semconv.McpMethodNameKey.String(method) for
   mcp.method.name and semconv.McpSessionID(sessionID) for mcp.session.id.
   Drop the attribute and RPCSystem imports that are no longer needed.

3. Log field names: align with semconv — "mcp.session.id" and
   "mcp.method.name" instead of "session_id"/"rpc.method".

4. OTel context injection: convert every logger.Xxx call in every tool
   file to logger.XxxContext(ctx, ...) so OtelHandler.Handle receives
   the span context and injects trace_id/span_id into each record.
   Also name hello_world's previously-blank context parameter.

🤖 Generated with [GitHub Copilot](https://github.com/features/copilot) (via Zed)

Signed-off-by: Eric Searcy <eric@linuxfoundation.org>
Rename mcpLoggingMiddleware → mcpOTelMiddleware to reflect that the
function now handles both OTel instrumentation and structured logging.

Add the following span attributes per the OpenTelemetry MCP semantic
conventions (https://opentelemetry.io/docs/specs/semconv/gen-ai/mcp/):

- network.transport: "pipe" for stdio, "tcp" for HTTP (detected via
  presence of HTTP headers in RequestExtra).
- mcp.protocol.version: negotiated version from the client's initialize
  params (Recommended).
- gen_ai.tool.name + gen_ai.operation.name = "execute_tool": set on
  tools/call spans (Conditionally Required / Recommended).
- error.type: set to the error string on JSON-RPC-level failures, or to
  "tool_error" when a tools/call result carries IsError = true
  (Conditionally Required).

🤖 Generated with [GitHub Copilot](https://github.com/features/copilot) (via Zed)

Signed-off-by: Eric Searcy <eric@linuxfoundation.org>
The SDK wraps regular tool errors into IsError=true results; only
*jsonrpc.Error (protocol-level failures such as unknown tool name)
surfaces as a non-nil err from the MethodHandler. Update the comment
to reflect this accurately.

🤖 Generated with [GitHub Copilot](https://github.com/features/copilot) (via Zed)

Signed-off-by: Eric Searcy <eric@linuxfoundation.org>
@emsearcy emsearcy requested a review from bramwelt March 24, 2026 23:47
- Use spec-correct OTEL_EXPORTER_OTLP_PROTOCOL values: replace the
  non-spec OTelProtocolHTTP ("http") constant with OTelProtocolHTTPProtobuf
  ("http/protobuf") and OTelProtocolHTTPJSON ("http/json"). Accept
  "http/json" with a warning and fall back to "http/protobuf" since the
  Go SDK does not implement it. Reject other unknown values.

- Wrap TraceIDRatioBased sampler with trace.ParentBased so incoming
  W3C TraceContext sampling decisions are respected across service
  boundaries.

- Drop OTEL_EXPORTER_OTLP_INSECURE entirely: it is not part of the
  standard OTEL SDK configuration spec. WithEndpointURL already derives
  TLS vs. plaintext from the URL scheme (Insecure = scheme != "https"),
  so passing http:// or https:// in the endpoint is sufficient.
  endpointURL() now always prepends http:// to bare host:port values;
  users requiring TLS supply an explicit https:// scheme. Remove the
  Insecure field from Config, all WithInsecure() calls from provider
  constructors, the OTEL_EXPORTER_OTLP_INSECURE env block from the Helm
  chart, and the insecure key from values.yaml.

- Validate OTEL_TRACES_EXPORTER, OTEL_METRICS_EXPORTER,
  OTEL_LOGS_EXPORTER: only "otlp" and "none" are supported; any other
  value (e.g. "jaeger", "zipkin") emits a warning and defaults to
  "none". Extracted into normaliseExporter() helper.

- Add tools.ParseEnvBool to internal/tools/helpers.go: accepts "true",
  "t", "1", "y", "yes" (case-insensitive), consistent with other boolean
  env var parsing in the codebase. Remove the now-unused parseEnvBool
  from tracing.go.

🤖 Generated with [GitHub Copilot](https://github.com/features/copilot) (via Zed)

Signed-off-by: Eric Searcy <eric@linuxfoundation.org>
ParseEnvBool was added in anticipation of use in tracing.go, but the
OTEL_EXPORTER_OTLP_INSECURE env var was removed entirely rather than
parsed. No callers exist; remove the function and the now-unused
strings import.

🤖 Generated with [GitHub Copilot](https://github.com/features/copilot) (via Zed)

Signed-off-by: Eric Searcy <eric@linuxfoundation.org>
- Move extraEnv block before OTEL env vars in deployment.yaml so that
  HOST_IP (injected via extraEnv from pod status.hostIP) is declared
  before OTEL_EXPORTER_OTLP_ENDPOINT, which references $(HOST_IP).
  Kubernetes resolves $(VAR_NAME) in env declaration order; placing
  extraEnv after the OTEL block left $(HOST_IP) as a literal string.

- Replace non-standard OTEL_TRACES_SAMPLE_RATIO with the spec-correct
  OTEL_TRACES_SAMPLER (enum) and OTEL_TRACES_SAMPLER_ARG (ratio).
  Add newSampler() which constructs the appropriate trace.Sampler from
  the configured OTEL_TRACES_SAMPLER value, supporting all standard
  sampler names: always_on, always_off, traceidratio,
  parentbased_always_on, parentbased_always_off,
  parentbased_traceidratio (default). Unrecognised values warn and fall
  back to parentbased_traceidratio. Rename TracesSampleRatio ->
  TracesSamplerArg in Config. Update Helm chart values keys and
  deployment template accordingly.

🤖 Generated with [GitHub Copilot](https://github.com/features/copilot) (via Zed)

Signed-off-by: Eric Searcy <eric@linuxfoundation.org>
@emsearcy emsearcy merged commit 89daa8c into main Mar 25, 2026
6 checks passed
@emsearcy emsearcy deleted the feat/ARCH-341-otel-tracing branch March 25, 2026 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants