Skip to content

JSRPC tracing instrumentation#6734

Draft
Ankcorn wants to merge 6 commits intomainfrom
tankcorn/jsrpc-session-spans
Draft

JSRPC tracing instrumentation#6734
Ankcorn wants to merge 6 commits intomainfrom
tankcorn/jsrpc-session-spans

Conversation

@Ankcorn
Copy link
Copy Markdown
Member

@Ankcorn Ankcorn commented May 6, 2026

On main, the only RPC-related user trace span was a single jsRpcSession created at the dispatch site in Fetcher::getClientForOneCall. That one span was doing three jobs simultaneously

  1. Per-RPC-call dispatch observability on the caller — "I dispatched an RPC at this point, here's the latency."
  2. Callee parent context. Its spanId flowed via metadata.userSpanParent and became the trigger context for the callee's onset.
  3. Server-side execution wrapper — the callee's user spans transitively rooted under it via (2).

This PR splits this into 3 spans, each in its proper home. This PR implements that split.

Changes

Span Where Job Fires for
jsRpcCall:client callImpl() per-call dispatch lifetime; metadata.userSpanParent token every RPC dispatch (in-process & cross-process, DO & service binding)
jsRpcSession (client) JsRpcSessionCustomEvent::sendRpc wire round-trip cross-process only
jsRpcSession (server) JsRpcSessionCustomEvent::run callee execution wrapper every RPC reception

Trace shape after this PR

In-process service-binding RPC:


caller invocation                     callee invocation
  enterSpan                             enterSpan
    jsRpcCall:client                      jsRpcSession (server, run())
                                            [user code]

In-process DO actor RPC:


caller invocation                     callee invocation
  enterSpan                             enterSpan
    jsRpcCall:client                      jsRpcSession (server, run())
      durable_object_subrequest             [user code]

Cross-process service-binding RPC:


caller invocation                     callee invocation (other process)
  enterSpan                             enterSpan
    jsRpcCall:client                      jsRpcSession (server, run())
      jsRpcSession (client, sendRpc())      [user code]

Supporting changes

  • getClient(... kj::None) overload (Fetcher): lets getClientForOneCall opt out of the dispatch-site span, since jsRpcCall:client upstream in callImpl is now the per-call parent.
  • TraceContext::getUserSpanParent() fallback: when userSpan is unobserved, falls back to IoContext::current().getCurrentUserTraceSpan(). Implementation moved from header to io-context.c++ (where IoContext is fully defined).
  • makeUserAsyncTraceScope is used in callImpl to push jsRpcCall:client onto the async-context frame so synchronous descendants (getClientbuildClientstartRequest) capture it as metadata.userSpanParent.

Test changes

tail-worker-test.js fixtures updated:

  • Callee event streams (myActorJsrpc, jsrpcGetCounter, jsrpcNonFunction) gain spanOpen/spanClose for the new server-side jsRpcSession.
  • jsrpcDoSubrequest event stream gains six jsRpcCall:client spans (one per RPC dispatch in the test) with durable_object_subrequest nested inside one of them.
  • expectedWithPropagation tree shape: callees now correctly parent under jsrpcDoSubrequest. One mis-link (nonFunctionProperty under getCounter) is documented as a workerd-standalone SequentialSpanSubmitter artifact — both server-side jsRpcSessions land on sequential id 0000000000000001 within the
    same trace, and the buildTree heuristic resolves to whichever invocation re-indexes that id first. Production span IDs are random 64-bit; no collision occurs there.

Reviewer notes

  • This is the corrected version of an earlier draft that only moved the client-side jsRpcSession into sendRpc(). That alone broke trace continuity for service-binding callers because nothing replaced job (2). Adding jsRpcCall:client in callImpl is the missing piece

Ankcorn added 3 commits May 6, 2026 10:07
…:sendRpc

Adds the client half of Dan's jsRpcSession placement for cross-process
dispatch: both an internal trace span and a user trace span on the
caller, wrapping the wire round-trip from sending the jsRpcSessionRequest
until the session completes (membrane drained).

In-process service bindings reach the callee via
WorkerEntrypoint::customEvent -> event->run() and never enter sendRpc();
for those calls the only jsRpcSession spans come from the server-side
run(). Cross-process callers get both: the wire-round-trip view here
and the membrane-lifetime view on the callee side.

Internal span goes into the caller's local request-internal trace; user
span is parented to the current user trace span (the caller's enclosing
user span). Both guarded by IoContext::hasCurrent() so capnp dispatch
contexts that lack an IoContext don't crash; in those cases the spans
are constructed unobserved.
Now that JsRpcSessionCustomEvent owns the user-visible jsRpcSession span
on both sides (run() server-side, sendRpc() client-side for cross-process),
Fetcher::getClientForOneCall doesn't want a span synthesised at the
dispatch site. Rather than inlining the channel switch there, add a new
Fetcher::getClient overload that takes nullptr in the operation-name
slot:

  getClient(ioContext, cfStr, "foo"_kjc)  // creates a 'foo' span here
  getClient(ioContext, cfStr, nullptr)    // no span here; caller emits it

The no-span overload still propagates the caller's current user span as
metadata.userSpanParent so cross-process trigger context flows
correctly. This is plumbed via a new TraceContext constructor that
takes a borrowed SpanParent for the user-side parent without owning a
SpanBuilder for it; getUserSpanParent() returns the borrowed parent
when present.

The two getClient overloads share a private buildClient() helper that
takes an already-constructed TraceContext, so the channel switch lives
in exactly one place.

Fetcher::getClientForOneCall becomes a clean three-line dispatch:

  auto worker = getClient(ioContext, kj::none, nullptr);
  auto event = kj::heap<JsRpcSessionCustomEvent>(...);
  ioContext.addTask(worker->customEvent(kj::mv(event))...);
  return event->getCap();

Tail-worker-test fixtures updated:
  - jsrpcDoSubrequest fixture loses the dispatch-site jsRpcSession spans
    (the caller's only client-side user span is now durable_object_subrequest).
  - expectedWithPropagation tree shape adjusted: the buildTree heuristic
    cross-links sibling callees because all callee invocations now have
    a colliding sequential id 0000000000000001 (the new server-side
    jsRpcSession). Reflects test-infra limitation, not misparenting --
    traceIds and triggerSpanIds remain correct.
…:run

Adds the server half of Dan's jsRpcSession placement: both an internal
trace span and a user trace span on the callee, wrapping the membrane
lifetime from incomingRequest->delivered() through to all caps being
dropped.

The internal span is a child of the callee's local request-internal
trace (no cross-process internal-trace propagation today, by design).
The user span is parented to the caller's enclosing user span via
USER_SPAN_CONTEXT_PROPAGATION (the metadata.userSpanParent set on the
client side flows here as the IoContext's current user span). Created
after delivered() so the IncomingRequest is registered before the spans
start; both close when the SpanBuilders leave scope at function exit.

Tail-worker-test fixtures updated:
  - The three callee event streams (myActorJsrpc, jsrpcNonFunction,
    jsrpcGetCounter) gain spanOpen/spanClose for the new server-side
    user jsRpcSession (only user spans surface in the streaming tail,
    so the internal half doesn't appear in fixtures).
  - expectedWithPropagation tree shape adjusted: every callee invocation
    now has a span with sequential id 0000000000000001 which collides
    across invocations in the same trace; the buildTree heuristic
    cross-links sibling callees as a result. Reflects that test-infra
    limitation, not actual misparenting.
@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented May 6, 2026

Good. This is fine in callImpl since we're already inside a js.tryCatch with a jsg::Lock& js, meaning the V8 isolate lock is held, so getCurrentLock() will succeed.

Let me now review the overall analysis systematically:

Analysis complete. Here's what I found:

  1. callImpl span lifetime (worker-rpc.c++:479-482): jsRpcCallSpan is created inside the tryCatch success lambda. On success, it's moved into the awaitIo callback lambda (line 595), where it will be destroyed when the RPC resolves or the IoContext is canceled. On error (tryCatch error branch at line 617, or catch blocks at line 627/630), it's destroyed when the lambda stack unwinds — correctly ending the span. The userTraceScope is an RAII scope guard that also properly unwinds.

  2. run() span lifetime (worker-rpc.c++:2193-2194): Both jsRpcSessionInternalSpan and jsRpcSessionSpan live on the coroutine frame. They're destroyed when the coroutine completes (via co_return at line 2219 or the exception path at line 2225). This correctly scopes the spans to the RPC session lifetime.

  3. sendRpc() span lifetime (worker-rpc.c++:2240-2246): Same coroutine pattern, spans destroyed at co_return (line 2311) or exception (line 2308). The IoContext::hasCurrent() guard is appropriate since this can be called outside an IoContext.

  4. TraceContext::getUserSpanParent() fallback (io-context.c++:1126-1134): The fallback from the header inline to a method in io-context.c++ is necessary because IoContext is only forward-declared in trace.h. The logic is sound: when the TraceContext has no observed userSpan, it falls back to the IoContext's current user span.

  5. getClient(... kj::None) overload (http.c++:2413-2421): Creates a default TraceContext() with no spans, so getUserSpanParent() will fall back to the IoContext's current user span (which is jsRpcCall:client pushed via makeUserAsyncTraceScope). This correctly preserves propagation.

  6. buildClient refactoring (http.c++:2429-2466): Clean extraction. The kj::FunctionParam<TraceContext()> callback pattern ensures span creation is skipped for OutgoingFactory variants.

  7. Test updates: Test expectations correctly reflect the new span structure. The comment about the nonFunctionProperty mis-link under getCounter being a test-infra artifact (sequential span IDs in workerd-standalone) is well-documented and not a production concern.

I don't see any actionable issues — the span lifetimes are correct, the API overload is clean, there are no backward-compatibility concerns (this is internal tracing, not user-facing API surface), and the code follows KJ conventions properly.

LGTM

github run

@Ankcorn Ankcorn force-pushed the tankcorn/jsrpc-session-spans branch from 12d2419 to 14126c4 Compare May 6, 2026 12:23
Ankcorn added 2 commits May 6, 2026 12:59
Restores the per-RPC-call dispatch span that previously lived implicitly at
the dispatch site as 'jsRpcSession' in Fetcher::getClientForOneCall. This
PR's earlier refactor moved the client-side jsRpcSession into sendRpc()
(cross-process only) per Dan's spec, but that left in-process and DO-bound
RPC dispatches without any client-side per-call span -- and therefore
nothing meaningful to forward as metadata.userSpanParent to the callee.

The fix follows the structure Dan originally described: the dispatch span
belongs in callImpl, where the call lifetime is unambiguous (open the span
before resolving the stub via getClientForOneCall, close when the call's
awaitIo continuation runs or is dropped). Pushing the span via
makeUserAsyncTraceScope makes it visible to synchronous code beneath
callImpl -- in particular, getClient -> buildClient -> startRequest, which
captures the span as metadata.userSpanParent.

This restores trace continuity across the RPC boundary for every dispatch
path, in-process and cross-process, on both DO and service-binding stubs:
durable_object_subrequest now nests under jsRpcCall:client as a routing
detail, and the callee's server-side jsRpcSession (in run()) parents under
the caller's jsRpcCall:client.

Tail-worker-test fixture updates:
  - jsrpcDoSubrequest gains six jsRpcCall:client spans (one per RPC
    dispatch in the test invocation), with durable_object_subrequest
    nested as a child of one of them.
  - expectedWithPropagation tree: callees now correctly link under
    jsrpcDoSubrequest. nonFunctionProperty mis-links under getCounter
    because of workerd-standalone's SequentialSpanSubmitter assigning
    sequential id 0000000000000001 to every callee invocation's
    server-side jsRpcSession; comment documents this as a test-infra
    limitation, not actual misparenting.
…push

Two review items:

1. Rename 'jsRpcCall:client' to 'jsRpcCall'. The ':client' separator was
   from Dan's design doc, but the existing span-name namespace uses
   camelCase (jsRpcSession), snake_case (durable_object_subrequest), and
   plain words (fetch, connect) -- there's no convention for ':' as a
   namespace separator, and committing to one in fixtures locks downstream
   tail consumers into a parser they don't have. Plain 'jsRpcCall' is
   unambiguous (the only callImpl-emitted span on the caller side) and
   stays in step with 'jsRpcSession'.

2. Guard the async-context push in callImpl. The previous call to
   ioContext.getCurrentLock() asserts via KJ_REQUIRE_NONNULL when no
   jsg::Lock is held, which can happen on capnp dispatch paths that
   re-enter from the network. Add IoContext::tryGetCurrentLock() returning
   kj::Maybe<Worker::Lock&> and use it to scope the StorageScope to the
   stub-resolution block when a lock is available; without one, the
   jsRpcCall span still records but we don't push it as the current user
   span (the dispatch then forwards whatever the caller's enclosing user
   span happens to be, same as before this PR's parenting work).
Comment thread src/workerd/io/trace.h Outdated
SpanParent getUserSpanParent() {
return SpanParent(userSpan);
}
// Returns the user-span parent to forward to the callee as metadata.userSpanParent.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, no, this is not right, TraceContext::getUserSpanParent has no business falling back to anything, it should only return the TraceContext's own user span parent

Comment thread src/workerd/io/io-context.c++ Outdated
return getMetrics().getSpan();
}

SpanParent TraceContext::getUserSpanParent() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope

Comment thread src/workerd/api/worker-rpc.c++ Outdated
kj::Vector<kj::StringPtr> path;
auto resolveClient = [&] { return parent.getClientForOneCall(js, path); };
auto client = [&] {
KJ_IF_SOME(lock, ioContext.tryGetCurrentLock()) {
Copy link
Copy Markdown
Collaborator

@danlapid danlapid May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you already have the lock, it's js, just use it, no need to define a new tryGetCurrentLock for it

Comment thread src/workerd/api/worker-rpc.c++ Outdated
// metadata.userSpanParent set on the client side flows here as the IoContext's
// current user span.
auto jsRpcSessionInternalSpan = ioctx.makeTraceSpan("jsRpcSession"_kjc);
auto jsRpcSessionSpan = ioctx.getCurrentUserTraceSpan().newChild("jsRpcSession"_kjc, ioctx.now());
Copy link
Copy Markdown
Collaborator

@danlapid danlapid May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makeUserTraceSpan is built especially to create spans 😄
Creates both a user and an internal span

- Remove TraceContext::getUserSpanParent fallback; revert to inline accessor
- Remove IoContext::tryGetCurrentLock; use getCurrentLock() in callImpl since the
  jsg::Lock is definitionally held there
- Move ambient-parent fallback into getSubrequestChannelImpl and buildClient's
  IoOwn<SubrequestChannel> branch, which is where SubrequestMetadata is built

Note: switching run()/sendRpc() to makeUserTraceSpan regressed tail tests under
the streaming-tail rollback path; kept the manual two-SpanBuilder pattern there.
Worth flagging.
@Ankcorn Ankcorn force-pushed the tankcorn/jsrpc-session-spans branch from d2500f1 to 1446623 Compare May 7, 2026 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants