Fix Pekko HTTP async test exception flakiness#10799
Open
Conversation
The async handler's exception path caused a failed Future whose span completion depended on Scala continuation cleanup. With strict trace writes enabled in tests, if the root span finished while continuations were still pending, the trace was enqueued to a discarding buffer and never written, causing a 20-second timeout in waitForTraces. Fix by recovering from exceptions in the async handler to return a proper 500 HTTP response instead of a failed Future. This routes span completion through the success path of the DatadogAsyncHandlerWrapper transform callback, avoiding the problematic continuation cleanup race. Also remove the @flaky annotation from the "test exception" test since the root cause is now fixed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BenchmarksStartupParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 62 metrics, 9 unstable metrics. Startup time reports for insecure-bankgantt
title insecure-bank - global startup overhead: candidate=1.61.0-SNAPSHOT~c5a177c1ef, baseline=1.61.0-SNAPSHOT~cc122288e5
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.057 s) : 0, 1056781
Total [baseline] (8.801 s) : 0, 8800684
Agent [candidate] (1.074 s) : 0, 1073676
Total [candidate] (8.822 s) : 0, 8822176
section iast
Agent [baseline] (1.225 s) : 0, 1224849
Total [baseline] (9.552 s) : 0, 9551565
Agent [candidate] (1.225 s) : 0, 1224817
Total [candidate] (9.569 s) : 0, 9568988
gantt
title insecure-bank - break down per module: candidate=1.61.0-SNAPSHOT~c5a177c1ef, baseline=1.61.0-SNAPSHOT~cc122288e5
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.187 ms) : 0, 1187
crashtracking [candidate] (1.204 ms) : 0, 1204
BytebuddyAgent [baseline] (626.62 ms) : 0, 626620
BytebuddyAgent [candidate] (636.446 ms) : 0, 636446
AgentMeter [baseline] (29.1 ms) : 0, 29100
AgentMeter [candidate] (29.456 ms) : 0, 29456
GlobalTracer [baseline] (256.517 ms) : 0, 256517
GlobalTracer [candidate] (259.975 ms) : 0, 259975
AppSec [baseline] (31.553 ms) : 0, 31553
AppSec [candidate] (31.978 ms) : 0, 31978
Debugger [baseline] (58.702 ms) : 0, 58702
Debugger [candidate] (59.53 ms) : 0, 59530
Remote Config [baseline] (585.489 µs) : 0, 585
Remote Config [candidate] (596.694 µs) : 0, 597
Telemetry [baseline] (8.629 ms) : 0, 8629
Telemetry [candidate] (8.738 ms) : 0, 8738
Flare Poller [baseline] (7.845 ms) : 0, 7845
Flare Poller [candidate] (9.424 ms) : 0, 9424
section iast
crashtracking [baseline] (1.193 ms) : 0, 1193
crashtracking [candidate] (1.196 ms) : 0, 1196
BytebuddyAgent [baseline] (794.943 ms) : 0, 794943
BytebuddyAgent [candidate] (794.639 ms) : 0, 794639
AgentMeter [baseline] (11.304 ms) : 0, 11304
AgentMeter [candidate] (11.297 ms) : 0, 11297
GlobalTracer [baseline] (246.819 ms) : 0, 246819
GlobalTracer [candidate] (247.111 ms) : 0, 247111
AppSec [baseline] (26.251 ms) : 0, 26251
AppSec [candidate] (26.341 ms) : 0, 26341
Debugger [baseline] (62.978 ms) : 0, 62978
Debugger [candidate] (62.675 ms) : 0, 62675
Remote Config [baseline] (526.64 µs) : 0, 527
Remote Config [candidate] (540.89 µs) : 0, 541
Telemetry [baseline] (14.799 ms) : 0, 14799
Telemetry [candidate] (14.957 ms) : 0, 14957
Flare Poller [baseline] (4.869 ms) : 0, 4869
Flare Poller [candidate] (4.734 ms) : 0, 4734
IAST [baseline] (25.103 ms) : 0, 25103
IAST [candidate] (25.105 ms) : 0, 25105
Startup time reports for petclinicgantt
title petclinic - global startup overhead: candidate=1.61.0-SNAPSHOT~c5a177c1ef, baseline=1.61.0-SNAPSHOT~cc122288e5
dateFormat X
axisFormat %s
section tracing
Agent [baseline] (1.06 s) : 0, 1060192
Total [baseline] (11.05 s) : 0, 11050247
Agent [candidate] (1.058 s) : 0, 1057937
Total [candidate] (11.006 s) : 0, 11006049
section appsec
Agent [baseline] (1.253 s) : 0, 1252653
Total [baseline] (11.132 s) : 0, 11132372
Agent [candidate] (1.246 s) : 0, 1245647
Total [candidate] (11.143 s) : 0, 11142587
section iast
Agent [baseline] (1.23 s) : 0, 1230122
Total [baseline] (11.31 s) : 0, 11310378
Agent [candidate] (1.23 s) : 0, 1230379
Total [candidate] (11.415 s) : 0, 11414710
section profiling
Agent [baseline] (1.179 s) : 0, 1178558
Total [baseline] (10.987 s) : 0, 10986922
Agent [candidate] (1.196 s) : 0, 1196197
Total [candidate] (11.18 s) : 0, 11180329
gantt
title petclinic - break down per module: candidate=1.61.0-SNAPSHOT~c5a177c1ef, baseline=1.61.0-SNAPSHOT~cc122288e5
dateFormat X
axisFormat %s
section tracing
crashtracking [baseline] (1.187 ms) : 0, 1187
crashtracking [candidate] (1.199 ms) : 0, 1199
BytebuddyAgent [baseline] (627.6 ms) : 0, 627600
BytebuddyAgent [candidate] (627.2 ms) : 0, 627200
AgentMeter [baseline] (29.162 ms) : 0, 29162
AgentMeter [candidate] (29.042 ms) : 0, 29042
GlobalTracer [baseline] (257.197 ms) : 0, 257197
GlobalTracer [candidate] (256.979 ms) : 0, 256979
AppSec [baseline] (31.56 ms) : 0, 31560
AppSec [candidate] (31.522 ms) : 0, 31522
Debugger [baseline] (59.427 ms) : 0, 59427
Debugger [candidate] (59.416 ms) : 0, 59416
Remote Config [baseline] (590.4 µs) : 0, 590
Remote Config [candidate] (583.054 µs) : 0, 583
Telemetry [baseline] (8.638 ms) : 0, 8638
Telemetry [candidate] (8.595 ms) : 0, 8595
Flare Poller [baseline] (8.797 ms) : 0, 8797
Flare Poller [candidate] (7.311 ms) : 0, 7311
section appsec
crashtracking [baseline] (1.202 ms) : 0, 1202
crashtracking [candidate] (1.19 ms) : 0, 1190
BytebuddyAgent [baseline] (662.019 ms) : 0, 662019
BytebuddyAgent [candidate] (657.268 ms) : 0, 657268
AgentMeter [baseline] (12.054 ms) : 0, 12054
AgentMeter [candidate] (12.097 ms) : 0, 12097
GlobalTracer [baseline] (259.946 ms) : 0, 259946
GlobalTracer [candidate] (258.269 ms) : 0, 258269
AppSec [baseline] (178.057 ms) : 0, 178057
AppSec [candidate] (177.685 ms) : 0, 177685
Debugger [baseline] (65.774 ms) : 0, 65774
Debugger [candidate] (64.812 ms) : 0, 64812
Remote Config [baseline] (586.874 µs) : 0, 587
Remote Config [candidate] (577.331 µs) : 0, 577
Telemetry [baseline] (8.935 ms) : 0, 8935
Telemetry [candidate] (9.887 ms) : 0, 9887
Flare Poller [baseline] (3.557 ms) : 0, 3557
Flare Poller [candidate] (3.666 ms) : 0, 3666
IAST [baseline] (24.112 ms) : 0, 24112
IAST [candidate] (23.904 ms) : 0, 23904
section iast
crashtracking [baseline] (1.2 ms) : 0, 1200
crashtracking [candidate] (1.198 ms) : 0, 1198
BytebuddyAgent [baseline] (799.226 ms) : 0, 799226
BytebuddyAgent [candidate] (799.033 ms) : 0, 799033
AgentMeter [baseline] (11.342 ms) : 0, 11342
AgentMeter [candidate] (11.367 ms) : 0, 11367
GlobalTracer [baseline] (247.276 ms) : 0, 247276
GlobalTracer [candidate] (247.368 ms) : 0, 247368
AppSec [baseline] (26.321 ms) : 0, 26321
AppSec [candidate] (26.459 ms) : 0, 26459
Debugger [baseline] (63.206 ms) : 0, 63206
Debugger [candidate] (64.046 ms) : 0, 64046
Remote Config [baseline] (531.28 µs) : 0, 531
Remote Config [candidate] (524.036 µs) : 0, 524
Telemetry [baseline] (14.818 ms) : 0, 14818
Telemetry [candidate] (14.321 ms) : 0, 14321
Flare Poller [baseline] (4.707 ms) : 0, 4707
Flare Poller [candidate] (4.743 ms) : 0, 4743
IAST [baseline] (25.151 ms) : 0, 25151
IAST [candidate] (25.144 ms) : 0, 25144
section profiling
crashtracking [baseline] (1.178 ms) : 0, 1178
crashtracking [candidate] (1.185 ms) : 0, 1185
BytebuddyAgent [baseline] (680.4 ms) : 0, 680400
BytebuddyAgent [candidate] (691.21 ms) : 0, 691210
AgentMeter [baseline] (8.587 ms) : 0, 8587
AgentMeter [candidate] (8.705 ms) : 0, 8705
GlobalTracer [baseline] (215.174 ms) : 0, 215174
GlobalTracer [candidate] (217.782 ms) : 0, 217782
AppSec [baseline] (31.743 ms) : 0, 31743
AppSec [candidate] (32.375 ms) : 0, 32375
Debugger [baseline] (65.143 ms) : 0, 65143
Debugger [candidate] (63.451 ms) : 0, 63451
Remote Config [baseline] (581.973 µs) : 0, 582
Remote Config [candidate] (600.213 µs) : 0, 600
Telemetry [baseline] (8.185 ms) : 0, 8185
Telemetry [candidate] (10.575 ms) : 0, 10575
Flare Poller [baseline] (3.451 ms) : 0, 3451
Flare Poller [candidate] (3.547 ms) : 0, 3547
ProfilingAgent [baseline] (93.404 ms) : 0, 93404
ProfilingAgent [candidate] (95.146 ms) : 0, 95146
Profiling [baseline] (93.977 ms) : 0, 93977
Profiling [candidate] (95.716 ms) : 0, 95716
LoadParameters
See matching parameters
SummaryFound 0 performance improvements and 6 performance regressions! Performance is the same for 14 metrics, 16 unstable metrics.
Request duration reports for petclinicgantt
title petclinic - request duration [CI 0.99] : candidate=1.61.0-SNAPSHOT~c5a177c1ef, baseline=1.61.0-SNAPSHOT~cc122288e5
dateFormat X
axisFormat %s
section baseline
no_agent (19.654 ms) : 19448, 19860
. : milestone, 19654,
appsec (18.571 ms) : 18385, 18757
. : milestone, 18571,
code_origins (17.623 ms) : 17450, 17795
. : milestone, 17623,
iast (17.575 ms) : 17402, 17748
. : milestone, 17575,
profiling (18.602 ms) : 18419, 18785
. : milestone, 18602,
tracing (17.88 ms) : 17702, 18058
. : milestone, 17880,
section candidate
no_agent (19.207 ms) : 19010, 19404
. : milestone, 19207,
appsec (19.544 ms) : 19339, 19750
. : milestone, 19544,
code_origins (17.835 ms) : 17657, 18014
. : milestone, 17835,
iast (17.78 ms) : 17603, 17957
. : milestone, 17780,
profiling (18.374 ms) : 18188, 18561
. : milestone, 18374,
tracing (17.436 ms) : 17263, 17609
. : milestone, 17436,
Request duration reports for insecure-bankgantt
title insecure-bank - request duration [CI 0.99] : candidate=1.61.0-SNAPSHOT~c5a177c1ef, baseline=1.61.0-SNAPSHOT~cc122288e5
dateFormat X
axisFormat %s
section baseline
no_agent (1.161 ms) : 1150, 1172
. : milestone, 1161,
iast (3.159 ms) : 3117, 3200
. : milestone, 3159,
iast_FULL (5.792 ms) : 5734, 5849
. : milestone, 5792,
iast_GLOBAL (3.43 ms) : 3376, 3485
. : milestone, 3430,
profiling (2.147 ms) : 2127, 2167
. : milestone, 2147,
tracing (1.745 ms) : 1731, 1759
. : milestone, 1745,
section candidate
no_agent (1.179 ms) : 1167, 1191
. : milestone, 1179,
iast (3.338 ms) : 3297, 3379
. : milestone, 3338,
iast_FULL (5.937 ms) : 5876, 5998
. : milestone, 5937,
iast_GLOBAL (3.587 ms) : 3529, 3645
. : milestone, 3587,
profiling (2.092 ms) : 2073, 2111
. : milestone, 2092,
tracing (1.76 ms) : 1745, 1775
. : milestone, 1760,
DacapoParameters
See matching parameters
SummaryFound 0 performance improvements and 0 performance regressions! Performance is the same for 11 metrics, 1 unstable metrics. Execution time for biojavagantt
title biojava - execution time [CI 0.99] : candidate=1.61.0-SNAPSHOT~c5a177c1ef, baseline=1.61.0-SNAPSHOT~cc122288e5
dateFormat X
axisFormat %s
section baseline
no_agent (15.603 s) : 15603000, 15603000
. : milestone, 15603000,
appsec (15.18 s) : 15180000, 15180000
. : milestone, 15180000,
iast (18.442 s) : 18442000, 18442000
. : milestone, 18442000,
iast_GLOBAL (18.077 s) : 18077000, 18077000
. : milestone, 18077000,
profiling (15.017 s) : 15017000, 15017000
. : milestone, 15017000,
tracing (14.965 s) : 14965000, 14965000
. : milestone, 14965000,
section candidate
no_agent (15.503 s) : 15503000, 15503000
. : milestone, 15503000,
appsec (15.112 s) : 15112000, 15112000
. : milestone, 15112000,
iast (17.739 s) : 17739000, 17739000
. : milestone, 17739000,
iast_GLOBAL (17.771 s) : 17771000, 17771000
. : milestone, 17771000,
profiling (15.329 s) : 15329000, 15329000
. : milestone, 15329000,
tracing (14.965 s) : 14965000, 14965000
. : milestone, 14965000,
Execution time for tomcatgantt
title tomcat - execution time [CI 0.99] : candidate=1.61.0-SNAPSHOT~c5a177c1ef, baseline=1.61.0-SNAPSHOT~cc122288e5
dateFormat X
axisFormat %s
section baseline
no_agent (1.476 ms) : 1465, 1488
. : milestone, 1476,
appsec (3.864 ms) : 3642, 4087
. : milestone, 3864,
iast (2.268 ms) : 2199, 2337
. : milestone, 2268,
iast_GLOBAL (2.318 ms) : 2248, 2389
. : milestone, 2318,
profiling (2.112 ms) : 2056, 2168
. : milestone, 2112,
tracing (2.094 ms) : 2040, 2149
. : milestone, 2094,
section candidate
no_agent (1.482 ms) : 1470, 1493
. : milestone, 1482,
appsec (3.825 ms) : 3603, 4046
. : milestone, 3825,
iast (2.266 ms) : 2196, 2335
. : milestone, 2266,
iast_GLOBAL (2.314 ms) : 2244, 2385
. : milestone, 2314,
profiling (2.11 ms) : 2053, 2167
. : milestone, 2110,
tracing (2.074 ms) : 2020, 2128
. : milestone, 2074,
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What Does This Do
Fixes the root cause of flaky
PekkoHttpServerInstrumentationAsyncTestandPekkoHttpServerInstrumentationAsyncHttp2Test"test exception" failures by recovering from exceptions in the async handler to return a proper HTTP 500 response instead of a failed Future.Removes the
@Flakyannotation from the "test exception" test since the root cause is now addressed.Motivation
The "test exception" test was failing intermittently (~2x in 30 days) with
TimeoutExceptionfromListWriter.waitForTraces. The root cause: when the async handler throws an exception, the Future fails, and span completion flows through theDatadogAsyncHandlerWrapper's error transform callback. WithstrictTraceWrites=true(used in tests), thePendingTraceBufferis a discarding buffer. If the root span finishes while Scala Future continuation references are still pending (count > 0), the trace is enqueued to the discarding buffer and never written, causing the 20-second timeout.By adding
.recoverto theasyncHandler's Future, exceptions are converted to a properHttpResponse(500)response. This routes span completion through the success path of the transform callback, which avoids the problematic failed-Future continuation cleanup race entirely. The server span is still correctly marked as errored because the HTTP 500 status triggers the server error check in the decorator.Relates to #9396
Additional Notes
asyncHandler, not the production instrumentation codeexpectedExtraErrorInformationpermits null values for error tagssyncHandler's exception behavior is unchanged; the.recoveronly affects the Future wrapper used byBindAndHandleAsyncandBindAndHandleAsyncHttp2latestDepTestvariants automatically benefit from this fix since they share the same source filesContributor Checklist
type:and (comp:orinst:) labels in addition to any other useful labelsclose,fix, or any linking keywords when referencing an issueUse
solvesinstead, and assign the PR milestone to the issueJira ticket: N/A