Introduce topSpanId to disambiguate incoming requests that share traceid #352

JacekLach · 2019-11-14T17:10:50Z

Before this PR

Logs from concurrent incoming requests with the same traceid are currently difficult
to tell apart (one can look at thread names etc, but not across executors).

After this PR

==COMMIT_MSG==
This introduces a new optional property on a Trace, topSpanId, that can be read to disambiguate span stacks that belong to the same traceid bug had different entry points.

Use that in tracing-undertow and tracing-jersey to tell apart requests with the same traceid.
==COMMIT_MSG==

Possible downsides?

The unsampled span has to keep a bit more state (not just the parent of the top span as per originating span id, but also the top span id).
We have to generate a new span id in unsampled spans if the stack is empty.

changelog-app · 2019-11-14T17:10:54Z

Generate changelog in `changelog/@unreleased`

Type

Description

Introduce a new property on a Trace, topSpanId, that can be read to disambiguate span stacks that belong to the same traceid but had different entry points.

Check the box to generate changelog(s)

Generate changelog entry

JacekLach · 2019-11-14T17:13:41Z

Followups are to include this new property in request and service logs

…raceid Logs from concurrent incoming requests with the same traceid are currently difficult to tell apart (one can look at thread names etc, but not across executors). This introduces a new optional property on a Trace, localTraceId, that we set to a new unique identifier if an incoming request already has a traceid assigned.

iamdanfox · 2019-11-14T18:11:39Z

I am not super comfortable about diverging from the concepts that zipkin already define: https://zipkin.io/pages/architecture.html.. we've already got james' X-B3-Originating-Span-Id which we already put on the wire:

tracing-java/tracing-okhttp3/src/test/java/com/palantir/tracing/okhttp3/OkhttpTraceInterceptorTest.java

Lines 75 to 78 in 2150bd2

    
           assertThat(intercepted.headers(TraceHttpHeaders.SPAN_ID)).hasSize(1); 
        
           assertThat(intercepted.headers(TraceHttpHeaders.TRACE_ID)).hasSize(1); 
        
           assertThat(intercepted.headers(TraceHttpHeaders.ORIGINATING_SPAN_ID)).isEmpty(); 
        
           assertThat(intercepted.headers(TraceHttpHeaders.PARENT_SPAN_ID)).isEmpty();

if normal service logs had james' originating span id as a param somehow, wouldn't you be able to achieve the desired outcome (of differentiating different concurrent requests to one service from the same caller even if they aren't sampled?)

Edit nvm seems like these would be the same

JacekLach · 2019-11-14T18:57:22Z

Yeah; ORIGINATING_SPAN_ID relates together requests made from the same caller within a traceid (and disambiguates them from requests hitting current service from a different caller, within the traceid).

However what we're trying to do here is tell apart two requests from the same caller, so we need to generate this id within the current service. (or rely on caller to never call us with the same x-b3-spanid)

JacekLach · 2019-11-14T18:58:06Z

tracing-java is pretty much the only place we can do this in, because we want all the usual trace propagation (executors, thread changes, etc) to apply to this property.

carterkozak · 2019-11-14T19:09:44Z

I'm not sure this is quite what we want, in the past we have discussed service-local request identifiers, which we can use to associate request data with all service logging for that request, not necessarily connected to tracing, though there are definitely parallels.
That way we could build queries for service logging from product Foo for requests to endpoint GET /api/bar.

JacekLach · 2019-11-14T19:18:44Z

Well, if we don't connect this to tracing then it has to be propagated manually when you say submit something to an executor while processing a request, right?

I don't think there's a point to having multiple thread-locals that we carry around to mark 'logically still the same unit of work'. Or is that not what you meant?

That way we could build queries for service logging from product Foo for requests to endpoint GET /api/bar.

This seems like a different question - you want to see what the request path for a service log line was? Sure, seems possibly useful, but unrelated to my problem :P

carterkozak · 2019-11-14T19:22:18Z

unrelated to my problem :P

Wouldn't it allow us to associate a span ID to the request, and the request ID to all request logging, solving your problem?

JacekLach · 2019-11-14T19:23:21Z

That assumes that the incoming span id is unique, which is the case for conjure clients currently but that seems like an implementation detail.
(unless you mean we'd generate a fresh id anyway? but then that seems to reduce to this pr, just bigger)

sfackler · 2019-11-14T21:09:03Z

The server generates a new span for the requests parented to the span sent by the client, so that one will always be unique.

JacekLach · 2019-11-15T14:35:42Z

The server generates a new span for the requests parented to the span sent by the client, so that one will always be unique.

Yeah, but the Trace can't really know that the second-to-oldest span in the stack is the special local-unique one.

Also, it's totally possible to close the request serving span (return a value) but still want to log things related to that request with the same traceid / localTraceId. For example you might be submitting an asynchronous cleanup task to run after your request completes, which is related to the trace/request, but not contained within it.

sfackler · 2019-11-15T14:39:46Z

You can parent a span to one that's already closed.

JacekLach · 2019-11-15T15:08:27Z

Hm, actually, I thought that the originating span id was the id of the top span in the stack, but actually it's the parent of the top span in the stack. So this can be done without the extra field / constructors. Will change

JacekLach · 2019-11-15T15:24:20Z

e57846e - we no longer maintain a separate id for the local trace, and instead expose the id of the topmost span in the stack.

tracing/src/main/java/com/palantir/tracing/Trace.java

JacekLach · 2019-11-15T15:46:34Z

(check failures are just the TODOs, which I'm planning to remove when I know if the current behaviour is a bug or not)

dansanduleac · 2019-11-18T18:56:43Z

All in all I think we should wire up the propagating the originatingSpanId (and your topSpanId) through the Trace, and remove the originatingSpanId from the OpenSpan as it doesn't belong there (as I explained in more detail in #352 (comment))

Previously cloning an unsampled trace would lose track of originatingSpanId, which mainly mattered when recovering the original trace after a `withTrace` call.

dansanduleac · 2019-11-19T21:25:35Z

I just noticed this, but since Trace.top() already exists and gets you the innermost span, which is the opposite of what you're introducing, we should probably call your topSpanId something else - maybe outermostSpanId?

carterkozak

It doesn't look like this uses the top-span identifier. If we implement this approach, how do we take advantage of the value without requiring all products to make code changes?

carterkozak · 2019-11-19T21:33:21Z

tracing/src/main/java/com/palantir/tracing/Trace.java

-            startSpan(Optional.of(parentSpanId));
+            if (numberOfSpans == 0) {
+                originatingSpanId = Optional.of(parentSpanId);
+                topSpanId = Optional.of(Tracers.randomId());


Do we ever want to generate a new random ID if the SpanType is not SERVER_INCOMING? In all other cases I think the traceId we already have should be sufficient.

JacekLach · 2019-11-19T22:47:15Z

It doesn't look like this uses the top-span identifier. If we implement this approach, how do we take advantage of the value without requiring all products to make code changes?

Followup PR is to update sls-logging to inject the identifier as an extra param to service logs. (that's what the jersey interceptor change facilitates) No opinion on filtering to only the SERVER_INCOMING span types; I assume CLIENT_OUTGOING spans should ~never be the root span, so the question is whether this might be useful for local spans? Which would require someone starting multiple span stacks with the same traceid.

…

On Tue, 19 Nov 2019 at 21:44, Carter Kozak ***@***.***> wrote: ***@***.**** commented on this pull request. It doesn't look like this uses the top-span identifier. If we implement this approach, how do we take advantage of the value without requiring all products to make code changes? ------------------------------ In tracing/src/main/java/com/palantir/tracing/Trace.java <#352 (comment)>: > } @OverRide void fastStartSpan(String _operation, String parentSpanId, SpanType _type) { - startSpan(Optional.of(parentSpanId)); + if (numberOfSpans == 0) { + originatingSpanId = Optional.of(parentSpanId); + topSpanId = Optional.of(Tracers.randomId()); Do we ever want to generate a new random ID if the SpanType is not SERVER_INCOMING? In all other cases I think the traceId we already have should be sufficient. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#352?email_source=notifications&email_token=AAJHSGI55CPVOD6PJIDG6B3QURM2ZA5CNFSM4JNO2FWKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCMESXQI#pullrequestreview-319368129>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJHSGKY4UUDSLJXCSLKTRLQURM2ZANCNFSM4JNO2FWA> .

carterkozak · 2019-11-22T17:42:58Z

Notes from a sync with @JacekLach @iamdanfox @dansanduleac and myself, tagging @ferozco for context as we chatted previously:

This is a divergence from standard zipkin, so we should be careful that we don't implement it in a way that prevents us from future features or refactors. However, tracing state is correctly propagated through most applications, and adding additional state would require a heavy investment. In particular utilities for asynchronous tracing (executors, detached traces, etc) would need to be duplicated across every codebase to additionally handle requestId, which is clearly not desirable.

We should implement this without any change to public API, this way we're not locked into supporting request identifiers in the tracing library for perpetuity, only that we maintain the functionality to associate logging with a request.

We can achieve this by setting a known MDC key, _requestId, the same way we set traceId and _sampled. This value should be initialized only when a new root span is created with span type SERVER_INCOMING.

iamdanfox · 2019-11-26T17:13:25Z

linking to the in-progress PR #364

JacekLach requested a review from carterkozak November 14, 2019 17:10

policy-bot bot requested a review from iamdanfox November 14, 2019 17:11

JacekLach force-pushed the jl/trace-local-id branch from 07f9af5 to 67b55a6 Compare November 14, 2019 17:12

JacekLach force-pushed the jl/trace-local-id branch from 67b55a6 to 50ad80c Compare November 14, 2019 17:20

JacekLach force-pushed the jl/trace-local-id branch from 50ad80c to 20aae2c Compare November 14, 2019 17:32

JacekLach commented Nov 15, 2019

View reviewed changes

tracing/src/main/java/com/palantir/tracing/Trace.java Show resolved Hide resolved

JacekLach commented Nov 15, 2019

View reviewed changes

tracing/src/main/java/com/palantir/tracing/Trace.java Outdated Show resolved Hide resolved

Use top span id as the local id

c72b86f

JacekLach force-pushed the jl/trace-local-id branch from e57846e to c72b86f Compare November 15, 2019 15:37

iamdanfox changed the title ~~Introduce localTraceId to disambiguate incoming requests that share traceid~~ Introduce topSpanId to disambiguate incoming requests that share traceid Nov 18, 2019

JacekLach added 2 commits November 19, 2019 12:23

Preserve trace metadata in Unsampled#deepClone

bbb2376

Previously cloning an unsampled trace would lose track of originatingSpanId, which mainly mattered when recovering the original trace after a `withTrace` call.

Add generated changelog entries

1694b66

JacekLach force-pushed the jl/trace-local-id branch from d785629 to bbb2376 Compare November 19, 2019 12:23

JacekLach requested a review from dansanduleac November 19, 2019 12:23

carterkozak reviewed Nov 19, 2019

View reviewed changes

topSpanId -> outermostSpanId

13a3238

iamdanfox closed this Nov 26, 2019

Introduce topSpanId to disambiguate incoming requests that share traceid #352

Introduce topSpanId to disambiguate incoming requests that share traceid #352

Uh oh!

Conversation

JacekLach commented Nov 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before this PR

After this PR

Possible downsides?

Uh oh!

changelog-app bot commented Nov 14, 2019 • edited by JacekLach Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Generate changelog in changelog/@unreleased

Uh oh!

JacekLach commented Nov 14, 2019

Uh oh!

iamdanfox commented Nov 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JacekLach commented Nov 14, 2019

Uh oh!

JacekLach commented Nov 14, 2019

Uh oh!

carterkozak commented Nov 14, 2019

Uh oh!

JacekLach commented Nov 14, 2019

Uh oh!

carterkozak commented Nov 14, 2019

Uh oh!

JacekLach commented Nov 14, 2019

Uh oh!

sfackler commented Nov 14, 2019

Uh oh!

JacekLach commented Nov 15, 2019

Uh oh!

sfackler commented Nov 15, 2019

Uh oh!

JacekLach commented Nov 15, 2019

Uh oh!

JacekLach commented Nov 15, 2019

Uh oh!

Uh oh!

Uh oh!

JacekLach commented Nov 15, 2019

Uh oh!

dansanduleac commented Nov 18, 2019

Uh oh!

dansanduleac commented Nov 19, 2019

Uh oh!

carterkozak left a comment

Choose a reason for hiding this comment

Uh oh!

carterkozak Nov 19, 2019

Choose a reason for hiding this comment

Uh oh!

JacekLach commented Nov 19, 2019 via email • edited by dansanduleac Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carterkozak commented Nov 22, 2019

Uh oh!

iamdanfox commented Nov 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

JacekLach commented Nov 14, 2019 •

edited

Loading

changelog-app bot commented Nov 14, 2019 •

edited by JacekLach

Loading

Generate changelog in `changelog/@unreleased`

iamdanfox commented Nov 14, 2019 •

edited

Loading

JacekLach commented Nov 19, 2019 via email •

edited by dansanduleac

Loading