Bound OTel traces emitted by long-running subscription handlers#63
Bound OTel traces emitted by long-running subscription handlers#63danielorbach wants to merge 12 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Removes the long-lived per-lifecycle OTel span (issue #56) so that spans started from L.Context() are roots of independent traces, and propagates lifecycle identity via context-stored attributes that a new NewSpanProcessor stamps onto every span. Adds bounded framework spans for runCleanup and Component.Bootstrap, a process-stable process.nonce, and Context-suffixed variants of Error/Fatal so per-handler spans can be the recording target.
Changes:
- New
tracing.go(tracer,processNonce,WithAttributes/WithForkAttributes,NewSpanProcessor) plus package doc;lifecycle.gono longer opens a lifecycle span and wraps cleanup in a bounded<name>.cleanupspan. Error/Fatal/WithSpan/WithForkSpanNamedeprecated; newErrorContext/FatalContext(+fvariants) record onto caller-supplied span; in-repo callers migrated.loaderback-linksClaim.Footprint, forks components with identity attributes viaclaimIdentity, emits a bounded<component>.bootstrapspan, and warns atLoadwith the process nonce;kafkaloader/fileloaderupdated accordingly.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tracing.go | New tracer, process nonce, identity-attribute context plumbing, and NewSpanProcessor. |
| tracing_test.go | Coverage for attribute propagation/replacement, nonce stamping, cleanup span, root spans, and context error/fatal recording. |
| lifecycle.go | Drops lifecycle span, stashes identity attrs on ctx, wraps cleanup in bounded span, deprecates Error/Fatal, adds *Context variants. |
| procedure.go | Removes span field and SpanName, deprecates WithSpan/WithForkSpanName, adds attrs to lifecycleOptions. |
| doc.go | New package doc describing the tracing contract, identity propagation, and operator anchor. |
| loader/loader.go | Back-links Claim.Footprint, logs process nonce, forks with WithForkAttributes, bounded <component>.bootstrap span, FatalfContext for bootstrap errors. |
| loader/example_test.go | Migrates Errorf → ErrorfContext. |
| fileloader/jsonloader.go | Fatalf → FatalfContext. |
| fileloader/yamlloader.go | Fatal → FatalfContext. |
| fileloader/fileloader.go | WithSpan("loader") → WithName("loader"). |
| kafkaloader/kafkaloader.go | WithSpan→WithName, error/fatal migrated to *Context variants, span ctx threaded into handler. |
| examples/direct/{main,ping,pong}.go | Migrate to *Context variants; pong.go adds bounded pong.echo consumer span. |
| examples/embed/{ping,pong,probe}.go | Migrate to ErrorfContext. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The framework's plan for issue #56 is to drop the long-lived lifecycle span and identify spans by attributes propagated through l.Context() instead. This commit lands the propagation primitive: WithAttributes and WithForkAttributes stash a slice in the context; NewSpanProcessor's OnStart reads it and applies the attributes to every span started under that context. No caller invokes the new options yet; subsequent commits rewire the loader and reference examples onto this primitive. The slice-replacement semantics fall out of plain context.WithValue: a fork that sets its own WithForkAttributes shadows the parent's value via the most-nested-wins rule, and a fork without the option inherits naturally. The framework does not merge; callers that want to extend the inherited slice must include the inherited values explicitly. OpenTelemetry Resource is not used because lifecycle identity has to be per-context (multiple components per process need distinct identifiers); SpanProcessor reading context values is the in-process equivalent.
Adds ErrorContext, ErrorfContext, FatalContext, and FatalfContext on L, mirroring slog.Logger.ErrorContext. Each variant logs via the lifecycle and, if ctx carries a recording span, records the error onto that span (Fatal additionally sets the span status to Error and exits the lifecycle goroutine). The motivation is the next commit, which detaches l.Context() from the long-lived lifecycle span. After that change, L.Error's trace.SpanFromContext(l.ctx) call has nothing recording to record onto, and the value is in the call site's handler span — which L itself cannot see. The ctx parameter is how the call site surfaces it. Existing Error and Fatal remain operational against the long-lived lifecycle span; their deprecation lands with the long-lived span's removal, where their behavior change is the immediate consequence.
Issue #56 is rooted in the framework opening one OTel span per L and holding it open for the lifecycle's full duration. A subscription Procedure that handled tens of thousands of messages per run produced a single trace whose span count grew with uptime; downstream Tempo ingesters dropped batches once per-tenant live-trace limits were hit, causing collateral failures for unrelated services on the same trace pipeline. This commit removes the lifecycle span from execute and drops the asynchronous reaper goroutine that closed it. Spans started from l.Context() are now roots of independent traces, bounded per invocation; the SpanProcessor introduced earlier ties them to the lifecycle via attribute filtering rather than parent-child nesting. The old Error/Errorf/Fatal/Fatalf methods used trace.SpanFromContext on the lifecycle context to record errors; with no span there to record onto, the SpanFromContext call returns a no-op span. Rather than silently dropping the trace integration, the methods are now log-only and marked Deprecated, pointing callers to the *Context variants that can record onto a handler-scope span at the call site. The package-level tracer var moves out of procedure.go in this commit because no caller in component remains. The next commit reintroduces it for the bounded cleanup span around runCleanup.
The previous commit dropped the long-lived lifecycle span. Cleanup timings and panics during runCleanup were observable through that span; with it gone they no longer surface in trace data. This commit restores that visibility with a bounded <name>.cleanup span scoped to runCleanup alone — open at the moment cleanup begins, closed when the last cleanup func returns or a panic unwinds the deferred frame. The span is always opened (no option) because the cost of an empty cleanup span is negligible and the alternative — silent cleanups — removes the only observability for shutdown-path errors. The wrapping defer sits between close(l.done) and l.wg.Wait so cleanup runs after forked children complete but before completion is signalled to external listeners, matching the existing runCleanup ordering. The package-level tracer var reintroduced in this commit lives in tracing.go alongside the SpanProcessor it pairs with; the framework's own span emission is colocated. Test setup moves to TestMain so the SpanProcessor binds once for the entire test binary. OTel's global TracerProvider delegate is set by sync.Once on the first SetTracerProvider call, so the per-test install helper used in earlier commits left subsequent tests reading from the first test's stale provider. A single binary-scoped recorder removes the stale-delegate gotcha; tests disambiguate by unique span names.
WithSpan and WithForkSpanName existed to label the long-lived
lifecycle span, which is gone after the previous commits. The options
remain in the API surface for v1 source compatibility but their bodies
are no-ops; both are marked Deprecated, pointing callers at WithName /
WithForkName for the L-naming concern the framework still cares about.
The lifecycleOptions.span field is removed alongside the SpanName
helper that read it — the user pushed to remove the field rather than
leave a dangling unused member. The Fork code path no longer sets it.
fileloader and kafkaloader were the two in-repo callers of
WithSpan("loader"). They are migrated to WithName("loader") so the
loader L keeps its log prefix, pprof label, and cleanup span name
under "loader"; without the migration those would have collapsed to
empty strings.
The next commit makes Claim.Exec build the lifecycle's identity attributes from the footprint it belongs to (footprint.identifier, footprint.revision, footprint.solution). Today Claim sees only itself and its Component descriptor; the Footprint that allocated it is not reachable from inside Exec, so the wiring would either duplicate every caller's footprint plumbing or change Exec's signature. A back-pointer set during Load — symmetric to the existing Component-on-Claim field — gives Exec the footprint it needs without either disruption. Users that build Claims by hand for tests may leave Footprint nil; the next commit's Exec degrades to component-only identity when Footprint is absent.
forkGroup.Fork now attaches the component's identity attributes (component.name plus footprint.identifier / revision / solution) to the forked L via WithForkAttributes. The framework's SpanProcessor reads those values from l.Context() and stamps them onto every span the Component (or any of its sub-procedures) starts, so operators can filter trace data by component or footprint without each Component plumbing attributes through its own instrumentation. Claim.Exec opens its own short-lived <component>.bootstrap span around Component.Bootstrap. The span replaces the role of the long-lived lifecycle span that issue #56 removed: a bounded record of bootstrap timing and outcome, with bootstrap-specific attributes (binding, options, locations, footprint name/description) on the span itself and identity attributes auto-applied by the SpanProcessor. Bootstrap failures set Error status on the span before FatalfContext exits the goroutine. A comment on forkGroup.Fork notes the gap operators need to know about: the effective lifecycle anchor in trace data is the pair (footprint.identifier, component.name); two processes loading the exact same footprint produce attribute-indistinguishable spans. A later commit closes the gap with a process nonce.
Issue #56 asked for the pong example to demonstrate the right per-message instrumentation shape, since the silent example was the shape readers copied when reaching for a reference. This commit adds a pong.echo span around the per-message send work, started from l.Context() with SpanKindConsumer. l.Context() is detached from any long-lived span after the framework rework, so the span is a root — bounded to one message, regardless of how long the subscription runs. component.NewSpanProcessor stamps it with the lifecycle's identity attributes (component.name=pong, footprint.identifier=..., footprint.revision=..., footprint.solution =...) by reading them from the context value the framework set when the lifecycle was created. The receive-error and send-error paths move to ErrorfContext so the error records onto whatever span is active at the call site rather than going trace-blind. Sub-procedures of bootstrapping Components inherit identity through the ctx-value chain, so the "sub" L this code runs under sees the same identity attributes that the bootstrap span carries — no re-declaration here.
A new reader reaching the package needs to understand four things about tracing before writing a handler: l.Context() is span-free, spans started from it are roots, identity travels as attributes via the SpanProcessor, and (footprint.identifier, component.name) is the effective anchor in trace data. doc.go states each plainly and shows the SDK setup that activates identity propagation. The cross-process collision risk is documented as it stands today: two processes loading the same Footprint produce attribute- indistinguishable spans, and operators relying on the framework's identity alone will see them conflated. The doc points at process- level resource attributes (service.instance.id, host.name) as the escape hatch operators have today. The deprecation of L.Error and L.Fatal lands in the doc here too — the *Context variants are the right call for handler-scope error recording now that there is no long-lived span for the unsuffixed methods to record onto.
The framework's identity attributes (component.name, footprint.*) are not unique across unix processes: two replicas loading the same Footprint would otherwise produce attribute-indistinguishable spans in trace data, and operators triaging an incident would have no attribute pivot to separate the two feeds. A package-level uuid generated at init becomes the process.nonce attribute on every span the framework's SpanProcessor sees. The nonce is stable per-process so attribute filters by process.nonce isolate one replica's spans across the full trace timeline. ProcessNonce() exposes it for callers that want to surface the value alongside log data. The loader emits a WARNING log line at the start of Load carrying the footprint identifier, revision, and the nonce, giving operators a pivot from log lines to trace data. The doc package overview reflects the new behaviour — the cross- process collision caveat the previous commit documented is now closed by the nonce.
The L.Error / L.Errorf / L.Fatal / L.Fatalf methods were deprecated when the long-lived lifecycle span was removed — they have no span to record onto and only log. The replacements take a context so they can record onto whatever span is active at the call site (typically a per-handler span the caller has opened). Every in-repo caller migrates to the ctx-bearing form. Where no handler-scope span is active (most subscription loops, bootstrap error paths in fileloader / kafkaloader), the call passes l.Context() — the *Context method still tolerates a span-free context, just records nothing in that case. Where a handler span is in scope (the handleFootprint span in kafkaloader, the per-probe ctx in examples/embed/probe), the call passes that ctx instead so the error lands on the relevant span. examples/direct/main.go drops the now-unused "fmt" import that the migration left behind. After this commit, golangci-lint reports no SA1019 warnings across the module.
The YAML footprint() constructor populated Name, Metadata, Identifier, and Locations but silently dropped Solution and Revision, even though FootprintYAML carries both and the JSON loader copies them. With the loader now deriving span identity attributes from these fields, every YAML-loaded lifecycle stamped footprint.solution="" and footprint.revision=0, breaking the operator-facing trace anchor.
| // (bounded by runCleanup) so it does not reintroduce the unbounded | ||
| // lifecycle span that issue #56 removed. | ||
| defer func() { | ||
| _, span := tracer.Start(l.ctx, l.name+".cleanup") |
There was a problem hiding this comment.
[High] Cleanup span captures only duration — the documented error/panic observability is not delivered.
The span's context is discarded (_, span :=), so runCleanup and the cleanup funcs never see it. CleanupError now routes failures through the deprecated log-only l.Error (no RecordError), and a cleanup panic unwinds through defer span.End() with no recover()+RecordError, so the span ends with Unset status and zero events.
This contradicts doc.go ("cleanup-time durations and panics remain observable"), the inline comment here ("durations and errors are observable"), and the commit message's claim that the span "restores that visibility". It is also a regression: cleanup errors used to land on the lifecycle span via l.Error→RecordError; now they land on no span at all.
Right-depth fix: thread the span's ctx into runCleanup(ctx) so CleanupContext/CleanupError can record, and add recover()+RecordError/SetStatus before the re-panic.
| if err := pub.Send(ctx, &pubsub.Message{Body: []byte(echo)}); err != nil { | ||
| l.ErrorfContext(ctx, "send pong: %w", err) | ||
| } | ||
| span.End() |
There was a problem hiding this comment.
[Medium] Reference pong.echo span: End() not deferred (leak on panic) + no SetStatus on send error.
span.End() is called inline, so if pub.Send panics the span never ends/exports and the long-running loop accumulates un-ended spans. On a non-panic send error, ErrorfContext records an exception event but the span keeps status Unset, so a backend computing error rate from span status reports 0% errors while every echo is dropped.
The framework's own kafkaloader.handleFootprint calls span.SetStatus(codes.Error, ...) on its error path — this example (the shape issue #56 designated as the per-message reference) contradicts it. Use defer span.End() and set Error status in the failure branch.
| err := pub.Send(l.Context(), &pubsub.Message{Body: options.Message}) | ||
| if err != nil { | ||
| l.Errorf("send: %w", err) | ||
| l.ErrorfContext(l.Context(), "send: %w", err) |
There was a problem hiding this comment.
[Medium] The loader-side feature has no executing test coverage. (re: loader/example_test.go:21, Example_loader)
Example_loader has no // Output: line, so go test compiles but never runs it, and loader/ contains zero func Test…. That leaves claimIdentity, the claim.Footprint = &footprint back-link, g.Fork's WithForkAttributes wiring, and Claim.Exec's bootstrap span/attributes entirely unexercised — tracing_test.go only covers the component-package primitives directly.
A regression in an attribute key, the back-link ordering, a dropped fork option, or a nil-deref on Footprint would pass CI green. Adding an // Output: line (or a real func Test… that forks a Claim through Load and asserts on recorded spans) would close the gap.
| // component name plus the footprint identifiers reachable through | ||
| // Claim.Footprint. Footprint may be nil for Claims built outside Load (e.g. | ||
| // tests); identity falls back to component name alone. | ||
| func claimIdentity(c *Claim) []attribute.KeyValue { |
There was a problem hiding this comment.
[Medium] Footprint identity attribute schema is hand-listed at five sites with no shared constructor.
The same footprint.* keys are restated in claimIdentity (here), the slog.Warn startup log using raw string keys (loader.go:109), Claim.Exec's bootstrap span (loader.go:390-400), loadFootprintComponents' skip/load span events (loader.go:211-223), and kafkaloader.handleFootprint (kafkaloader.go:100-107). The key sets and encodings already diverge — e.g. footprint.solution is treated as identity here but as descriptive metadata in kafkaloader.
Renaming or retyping a dimension (or adding footprint.solution to the span events to match) means editing five spots across two packages plus doc.go; miss one and the (footprint.identifier, component.name) anchor the docs advertise silently drifts. A single footprintIdentity(fp) helper would define it once.
| sub, err := binding.LinkInterest(l.Context(), PongAspect) | ||
| if err != nil { | ||
| l.Fatal(err) | ||
| l.FatalContext(l.Context(), err) |
There was a problem hiding this comment.
[Medium-low] The example never registers NewSpanProcessor() that its own comments rely on. (re: EntrypointProc at main.go:42 and the comment at pong.go:56)
No non-test file calls otel.SetTracerProvider/NewSpanProcessor(), yet pong.go:56 asserts "component.NewSpanProcessor stamps the span with the lifecycle's identity attributes (component.name=pong, ...)". As shipped, the direct example produces spans with no identity attributes and no process.nonce, so a reader copying it — or following doc.go's "read the loader's process.nonce log line, then pivot to spans" workflow — gets nothing.
Issue #56 asked the example to demonstrate the right instrumentation shape; it shows span creation but not the identity-propagation half. Wiring up a TracerProvider with NewSpanProcessor() in main (even with a stdout exporter) would make the example match its comments.
| // waitForSpan polls sr.Ended() until a span with name appears, accommodating | ||
| // the framework's asynchronous span reaper. | ||
| func waitForSpan(t *testing.T, sr *tracetest.SpanRecorder, name string) sdktrace.ReadOnlySpan { | ||
| t.Helper() |
There was a problem hiding this comment.
[Low-medium] waitForSpan polls for an "asynchronous span reaper" that this PR removed.
The doc comment says it polls "accommodating the framework's asynchronous span reaper" — but the reaper goroutine was deleted in this PR. All spans (cleanup, bootstrap, handler) now end synchronously before RunProc returns, so the poll loop, deadline, and SyncTimeout dependency are dead complexity referencing a mechanism the same PR deleted. A direct sr.Ended() lookup after RunProc returns now suffices and would fail loudly on a real regression instead of timing out.
| // Back-link each Claim to the Footprint it belongs to so Claim.Exec can | ||
| // build identity attributes (component.name + footprint.*) without | ||
| // threading the Footprint through every call site. | ||
| for _, claim := range footprint.Allocations { |
There was a problem hiding this comment.
[Low-medium, latent] claim.Footprint = &footprint aliases the by-value Load parameter.
This takes the address of Load's local Footprint copy and writes it into every *Claim in the caller's shared Allocations slice: all claims alias one heap-escaped snapshot (decoupled from the caller's Footprint), and the caller's Claim objects are mutated as a side effect.
Safe today only because loaded.Swap(true) forbids a second Load. The package TODOs plan multi-footprint support; once that guard relaxes (or a caller reuses a []*Claim), a second Load clobbers every claim's Footprint pointer and claimIdentity emits the wrong footprint.identifier/revision/solution.
| } | ||
| // process.nonce disambiguates spans emitted by different processes that | ||
| // share the same identity attributes (e.g., two replicas loading the | ||
| // same Footprint). The value is stable per-process. |
There was a problem hiding this comment.
[Low-medium] OnStart hot path: per-span process.nonce construction + two SetAttributes calls.
OnStart runs on every span start in the process. It builds attribute.String("process.nonce", processNonce) fresh each time and issues a second SetAttributes call. Since the nonce is process-constant, precompute a package-level attribute.KeyValue and fold it into a single SetAttributes.
Also: appending the nonce last makes it the first attribute dropped under a tight SpanLimits AttributeCountLimit, so replica disambiguation silently fails on attribute-heavy bootstrap spans — exactly where it's needed.
| // every span as the process.nonce attribute. Two unix processes with the same | ||
| // component identity (component.name, footprint.identifier, ...) would | ||
| // otherwise produce attribute-indistinguishable spans; the nonce gives | ||
| // operators a stable per-process dimension to disambiguate them. |
There was a problem hiding this comment.
[Low, design] process.nonce is a per-span attribute rather than an OTel Resource attribute.
doc.go itself names service.instance.id/host.name as the standard replica disambiguators, and no sdktrace.Resource is set anywhere. A random per-span UUID duplicates a value OTel models once at Resource scope and can't be correlated to the host/pod the way a Resource detector's host.name/k8s.pod.name can — operators still can't answer "which replica" without grepping the one startup log line. The right altitude is a documented WithResource(service.instance.id) on the caller's TracerProvider.
| // SpanProcessor returned by NewSpanProcessor can apply them to spans started | ||
| // under l.Context(). A nil slice leaves any inherited attributes from the | ||
| // parent context visible; a non-nil slice replaces them. | ||
| if options.attrs != nil { |
There was a problem hiding this comment.
[Low, API edge] No way to clear inherited identity via WithForkAttributes().
A zero-arg WithForkAttributes() passes a nil slice (Go variadic semantics), so this if options.attrs != nil guard skips re-stashing and the child inherits the parent's attributes. There's no public way to express "replace inherited identity with nothing", which contradicts a strict reading of the documented "a fork that sets its own WithForkAttributes fully replaces the inherited slice".
Non-triggering for the loader (claimIdentity always returns ≥1 attr) and the nil-inherits branch has no test, but it's a latent trap for future callers.
| var spanRecorder *tracetest.SpanRecorder | ||
|
|
||
| func TestMain(m *testing.M) { | ||
| spanRecorder = tracetest.NewSpanRecorder() |
There was a problem hiding this comment.
[Low, test isolation] TestMain installs one process-global recorder that is never reset.
Every RunProc across the package (lifecycle_test.go has ~30) now emits a real <name>.cleanup span into this single, ever-growing spanRecorder, and waitForSpan returns the first span matching a name. It works only because names are currently unique — the two WithName("") tests in TestL_Name already each emit a span named .cleanup into the shared recorder. A future or reordered test that reuses a span name, or asserts on span counts/absence, would read an unrelated earlier test's span, giving order-dependent flakiness.
| defer span.End() | ||
| l.FatalContext(handlerCtx, fmt.Errorf("handler fatal")) | ||
| aborted = false // unreachable | ||
| _ = aborted |
There was a problem hiding this comment.
[Low, test smell] TestL_FatalContext/RecordsAndAborts never verifies the abort.
aborted = false // unreachable re-assigns the already-zero value and _ = aborted is never read after RunProc returns, so nothing checks that FatalContext actually called runtime.Goexit(). A regression where FatalContext returned normally instead of aborting would still pass, because RecordError/SetStatus run before Goexit regardless. Set a flag after the FatalContext call and assert it stayed false to actually cover the abort.
|
A component built with this framework produced one OTel trace per process rather than per message: the lifecycle held a single span open across
Procedure.Execand every handler-derived span became its child. Tempo on the receiving side accumulated tens of thousands of spans under one trace ID, hit per-tenant live-trace limits, and dropped exports for unrelated services on the same pipeline. Tail-based sampling did not help because the trace never completed.The framework now opens no span across
Procedure.Exec. Spans started froml.Context()are roots of independent traces, bounded by the work done in one handler invocation. Lifecycle identity travels as attributes (component.name,footprint.identifier,footprint.revision,footprint.solution) propagated through a context value and stamped onto every span by aSpanProcessorusers register on theirTracerProvider. The pair (footprint.identifier,component.name) is the effective anchor operators query when filtering trace data for one component, and a process-stableprocess.nonceattribute closes the gap between two replicas loading the same footprint.The cost of dropping the long-lived span is the loss of clickable link navigation between a handler span and "its" lifecycle. A synthesized anchor
SpanContext, and a perpetually-zero-duration anchor span, were considered and rejected: backends without good link resolution would see dangling links, the value-add over attribute filtering was marginal, and both would have reintroduced retention-drift fragility. Operators pivot via attribute filters; the loader emits aWARNINGlog atLoadcarrying the nonce so log-to-trace correlation works without changing exporter setup.L.Error,L.Errorf,L.Fatal, andL.Fatalfbecome log-only and are deprecated; they previously recorded onto the long-lived span and have no replacement target without a call-site context.ErrorContextandFatalContextvariants takectxso they can record onto whatever span the caller has active.WithSpanandWithForkSpanNamesimilarly deprecate to no-ops since their consumer is gone. All in-repo callers migrate;staticcheckreports zeroSA1019warnings after the final commit.Three things this PR explicitly does not address. Optional clickable linking from handler spans to a bootstrap-or-anchor span can land later via the same
ctxvalue theSpanProcessoralready reads, with no API breakage. The deprecated method bodies stay log-only for v1 compatibility; removal is a future major. Merge semantics on nestedWithForkAttributesintentionally match plaincontext.WithValuewith no special-case code, so callers that need to extend an inherited slice include the inherited values explicitly.The earlier approach — detach the long-lived span from
l.Context()but keep it open off-context — is preserved at #64 for reference. The stronger no-long-lived-spans invariant landed after that approach was evaluated.Resolves #56.