Architectural review: OTel pipeline scaling observations for enterprise deployments

## Context

I was invited to review the traceAI OTel pipeline before launch, specifically to evaluate how it holds up under enterprise-scale workloads. I run OTel collectors and custom receivers in production Kubernetes clusters processing millions of spans/day, so these observations come from that perspective.

Overall the architecture is solid — wrapping standard OTel SDK components and supporting multiple semantic conventions is the right call. But there are a few things in `fi_instrumentation/otel.py` that will bite platform teams running this at scale.

## Findings

### 1. Synchronous HTTP call blocks tracer initialization

`register()` calls `check_custom_eval_config_exists()` at line 134, which does a blocking `requests.post()` (line 677) with no timeout parameter. In a Kubernetes pod where the container startup budget is tight, this adds uncontrolled latency to the critical path. If the FI API is temporarily unreachable (network partition, cold start, DNS resolution lag), the whole application hangs at init.

Suggestion: add a `timeout=` to the requests call (even 2-3s would help), or defer the check to a background thread so the tracer can start accepting spans immediately.

### 2. Exporter `shutdown()` doesn't call `super().shutdown()`

Both `GRPCSpanExporter.shutdown()` (line 564) and `HTTPSpanExporter.shutdown()` (line 605) only close `self._session` but never call `super().shutdown()`. The parent class (`OTLPSpanExporter`) has its own shutdown logic that flushes any buffered data. Skipping it means pending spans are silently dropped on process exit. This is a data loss bug in any deployment that uses graceful shutdown (which is every Kubernetes deployment via SIGTERM).

Fix: call `super().shutdown()` after closing the session.

### 3. Signal handler calls `sys.exit(0)` on SIGTERM

`setup_signal_handlers()` at line 332 registers a handler that calls `sys.exit(0)` when SIGTERM is received. In Kubernetes, SIGTERM is the standard graceful shutdown signal. Calling `sys.exit(0)` from a signal handler skips `finally` blocks and cleanup code in the user's application. The tracer should flush and shut itself down, but it shouldn't kill the process — that's the application's responsibility.

Suggestion: have the signal handler call `self.shutdown()` without `sys.exit()`, or at minimum document this behavior so users know the tracer takes over process lifecycle.

### 4. `_active_spans` dict has no thread safety

`SimpleSpanProcessor._active_spans` (line 403) is a plain `dict` that gets written on `on_start()` and read/deleted on `on_end()`. In async frameworks (FastAPI, aiohttp) or multi-threaded applications, concurrent access to this dict is a race condition. The OTel SDK's own span processor uses a thread-safe queue internally, but this custom tracking dict sits outside that protection.

Suggestion: use a `threading.Lock` around `_active_spans` access, or switch to a `collections.OrderedDict` with a lock if ordering matters for the leak detection.

### 5. `print()` statements in production code paths

Lines 190, 255, 326, 442, 570, 611, and 687 all use `print()` for operational output. In production Kubernetes pods, stdout may or may not be captured depending on the logging driver. The module already has `logger = logging.getLogger(__name__)` defined at line 55 — switching these to `logger.info()` / `logger.warning()` would let platform teams route traceAI operational logs through their existing structured logging pipeline (EFK, Loki, etc.) instead of relying on stdout capture.

### 6. Hardcoded 5s flush timeout on shutdown

`TracerProvider.shutdown()` at line 323 uses `timeout_millis=5000`. For batch workloads that generate thousands of spans in a burst (think: LLM evaluation pipelines), 5 seconds may not be enough to flush the export queue. Kubernetes sends SIGTERM and waits `terminationGracePeriodSeconds` (default 30s) before SIGKILL — there's usually budget to spare. Making this configurable (env var or constructor param) would be a small but meaningful improvement.

### 7. Go SDK gap

The roadmap mentions Go support, and there's currently no `go/` directory. A lot of enterprise teams are running Go services that call LLM APIs (either directly or through gateways). I'm happy to contribute a Go SDK following the same instrumentor pattern as Python/TypeScript — I maintain an OpenTelemetry GPU receiver in Go and have worked with the OTel Go SDK extensively. Will open a separate PR for this.

---

Happy to submit PRs for any of the above. Prioritizing the exporter shutdown bug (#2) and the Go SDK (#7) since those have the most immediate impact.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architectural review: OTel pipeline scaling observations for enterprise deployments #164

Context

Findings

1. Synchronous HTTP call blocks tracer initialization

2. Exporter `shutdown()` doesn't call `super().shutdown()`

3. Signal handler calls `sys.exit(0)` on SIGTERM

4. `_active_spans` dict has no thread safety

5. `print()` statements in production code paths

6. Hardcoded 5s flush timeout on shutdown

7. Go SDK gap

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Architectural review: OTel pipeline scaling observations for enterprise deployments #164

Description

Context

Findings

1. Synchronous HTTP call blocks tracer initialization

2. Exporter shutdown() doesn't call super().shutdown()

3. Signal handler calls sys.exit(0) on SIGTERM

4. _active_spans dict has no thread safety

5. print() statements in production code paths

6. Hardcoded 5s flush timeout on shutdown

7. Go SDK gap

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. Exporter `shutdown()` doesn't call `super().shutdown()`

3. Signal handler calls `sys.exit(0)` on SIGTERM

4. `_active_spans` dict has no thread safety

5. `print()` statements in production code paths