Context
I was invited to review the traceAI OTel pipeline before launch, specifically to evaluate how it holds up under enterprise-scale workloads. I run OTel collectors and custom receivers in production Kubernetes clusters processing millions of spans/day, so these observations come from that perspective.
Overall the architecture is solid — wrapping standard OTel SDK components and supporting multiple semantic conventions is the right call. But there are a few things in fi_instrumentation/otel.py that will bite platform teams running this at scale.
Findings
1. Synchronous HTTP call blocks tracer initialization
register() calls check_custom_eval_config_exists() at line 134, which does a blocking requests.post() (line 677) with no timeout parameter. In a Kubernetes pod where the container startup budget is tight, this adds uncontrolled latency to the critical path. If the FI API is temporarily unreachable (network partition, cold start, DNS resolution lag), the whole application hangs at init.
Suggestion: add a timeout= to the requests call (even 2-3s would help), or defer the check to a background thread so the tracer can start accepting spans immediately.
2. Exporter shutdown() doesn't call super().shutdown()
Both GRPCSpanExporter.shutdown() (line 564) and HTTPSpanExporter.shutdown() (line 605) only close self._session but never call super().shutdown(). The parent class (OTLPSpanExporter) has its own shutdown logic that flushes any buffered data. Skipping it means pending spans are silently dropped on process exit. This is a data loss bug in any deployment that uses graceful shutdown (which is every Kubernetes deployment via SIGTERM).
Fix: call super().shutdown() after closing the session.
3. Signal handler calls sys.exit(0) on SIGTERM
setup_signal_handlers() at line 332 registers a handler that calls sys.exit(0) when SIGTERM is received. In Kubernetes, SIGTERM is the standard graceful shutdown signal. Calling sys.exit(0) from a signal handler skips finally blocks and cleanup code in the user's application. The tracer should flush and shut itself down, but it shouldn't kill the process — that's the application's responsibility.
Suggestion: have the signal handler call self.shutdown() without sys.exit(), or at minimum document this behavior so users know the tracer takes over process lifecycle.
4. _active_spans dict has no thread safety
SimpleSpanProcessor._active_spans (line 403) is a plain dict that gets written on on_start() and read/deleted on on_end(). In async frameworks (FastAPI, aiohttp) or multi-threaded applications, concurrent access to this dict is a race condition. The OTel SDK's own span processor uses a thread-safe queue internally, but this custom tracking dict sits outside that protection.
Suggestion: use a threading.Lock around _active_spans access, or switch to a collections.OrderedDict with a lock if ordering matters for the leak detection.
5. print() statements in production code paths
Lines 190, 255, 326, 442, 570, 611, and 687 all use print() for operational output. In production Kubernetes pods, stdout may or may not be captured depending on the logging driver. The module already has logger = logging.getLogger(__name__) defined at line 55 — switching these to logger.info() / logger.warning() would let platform teams route traceAI operational logs through their existing structured logging pipeline (EFK, Loki, etc.) instead of relying on stdout capture.
6. Hardcoded 5s flush timeout on shutdown
TracerProvider.shutdown() at line 323 uses timeout_millis=5000. For batch workloads that generate thousands of spans in a burst (think: LLM evaluation pipelines), 5 seconds may not be enough to flush the export queue. Kubernetes sends SIGTERM and waits terminationGracePeriodSeconds (default 30s) before SIGKILL — there's usually budget to spare. Making this configurable (env var or constructor param) would be a small but meaningful improvement.
7. Go SDK gap
The roadmap mentions Go support, and there's currently no go/ directory. A lot of enterprise teams are running Go services that call LLM APIs (either directly or through gateways). I'm happy to contribute a Go SDK following the same instrumentor pattern as Python/TypeScript — I maintain an OpenTelemetry GPU receiver in Go and have worked with the OTel Go SDK extensively. Will open a separate PR for this.
Happy to submit PRs for any of the above. Prioritizing the exporter shutdown bug (#2) and the Go SDK (#7) since those have the most immediate impact.
Context
I was invited to review the traceAI OTel pipeline before launch, specifically to evaluate how it holds up under enterprise-scale workloads. I run OTel collectors and custom receivers in production Kubernetes clusters processing millions of spans/day, so these observations come from that perspective.
Overall the architecture is solid — wrapping standard OTel SDK components and supporting multiple semantic conventions is the right call. But there are a few things in
fi_instrumentation/otel.pythat will bite platform teams running this at scale.Findings
1. Synchronous HTTP call blocks tracer initialization
register()callscheck_custom_eval_config_exists()at line 134, which does a blockingrequests.post()(line 677) with no timeout parameter. In a Kubernetes pod where the container startup budget is tight, this adds uncontrolled latency to the critical path. If the FI API is temporarily unreachable (network partition, cold start, DNS resolution lag), the whole application hangs at init.Suggestion: add a
timeout=to the requests call (even 2-3s would help), or defer the check to a background thread so the tracer can start accepting spans immediately.2. Exporter
shutdown()doesn't callsuper().shutdown()Both
GRPCSpanExporter.shutdown()(line 564) andHTTPSpanExporter.shutdown()(line 605) only closeself._sessionbut never callsuper().shutdown(). The parent class (OTLPSpanExporter) has its own shutdown logic that flushes any buffered data. Skipping it means pending spans are silently dropped on process exit. This is a data loss bug in any deployment that uses graceful shutdown (which is every Kubernetes deployment via SIGTERM).Fix: call
super().shutdown()after closing the session.3. Signal handler calls
sys.exit(0)on SIGTERMsetup_signal_handlers()at line 332 registers a handler that callssys.exit(0)when SIGTERM is received. In Kubernetes, SIGTERM is the standard graceful shutdown signal. Callingsys.exit(0)from a signal handler skipsfinallyblocks and cleanup code in the user's application. The tracer should flush and shut itself down, but it shouldn't kill the process — that's the application's responsibility.Suggestion: have the signal handler call
self.shutdown()withoutsys.exit(), or at minimum document this behavior so users know the tracer takes over process lifecycle.4.
_active_spansdict has no thread safetySimpleSpanProcessor._active_spans(line 403) is a plaindictthat gets written onon_start()and read/deleted onon_end(). In async frameworks (FastAPI, aiohttp) or multi-threaded applications, concurrent access to this dict is a race condition. The OTel SDK's own span processor uses a thread-safe queue internally, but this custom tracking dict sits outside that protection.Suggestion: use a
threading.Lockaround_active_spansaccess, or switch to acollections.OrderedDictwith a lock if ordering matters for the leak detection.5.
print()statements in production code pathsLines 190, 255, 326, 442, 570, 611, and 687 all use
print()for operational output. In production Kubernetes pods, stdout may or may not be captured depending on the logging driver. The module already haslogger = logging.getLogger(__name__)defined at line 55 — switching these tologger.info()/logger.warning()would let platform teams route traceAI operational logs through their existing structured logging pipeline (EFK, Loki, etc.) instead of relying on stdout capture.6. Hardcoded 5s flush timeout on shutdown
TracerProvider.shutdown()at line 323 usestimeout_millis=5000. For batch workloads that generate thousands of spans in a burst (think: LLM evaluation pipelines), 5 seconds may not be enough to flush the export queue. Kubernetes sends SIGTERM and waitsterminationGracePeriodSeconds(default 30s) before SIGKILL — there's usually budget to spare. Making this configurable (env var or constructor param) would be a small but meaningful improvement.7. Go SDK gap
The roadmap mentions Go support, and there's currently no
go/directory. A lot of enterprise teams are running Go services that call LLM APIs (either directly or through gateways). I'm happy to contribute a Go SDK following the same instrumentor pattern as Python/TypeScript — I maintain an OpenTelemetry GPU receiver in Go and have worked with the OTel Go SDK extensively. Will open a separate PR for this.Happy to submit PRs for any of the above. Prioritizing the exporter shutdown bug (#2) and the Go SDK (#7) since those have the most immediate impact.