fix(container-runtime): Fix stop timeout race in SummaryManager (0x263, 0x264)#26557
Conversation
…rts 0x263, 0x264) PR microsoft#26271 added a 2-minute stop timeout that calls cleanupAfterSummarizerStop() if the summarizer hangs. However, the .finally() block in startSummarization() also calls cleanupAfterSummarizerStop() unconditionally. When the timeout fires before the promise chain completes, the .finally() performs a redundant cleanup that either hits assert 0x264 (state is already Off) or cascades into assert 0x263 (double-cleanup corrupts a new summarizer cycle's state). Fix: add a summarizerGeneration counter incremented in cleanupAfterSummarizerStop. The promise chain captures the generation at start; .finally() skips cleanup if the generation has changed, meaning another path already performed cleanup. AB#59694 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes a race in SummaryManager where the 2-minute stop-timeout cleanup could run before the startSummarization() promise chain’s .finally() cleanup, causing double-cleanup and state-machine corruption (asserts 0x263 / 0x264) observed in stress tests.
Changes:
- Add a monotonically increasing
summarizerGenerationcounter to detect stale.finally()cleanup and skip redundant cleanup when another path already cleaned up. - Emit
SummarizerCleanupAlreadyDonetelemetry when stale cleanup is detected. - Add two unit tests that reproduce the timeout/double-cleanup races and validate the fix.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| packages/runtime/container-runtime/src/summary/summaryManager.ts | Prevents double-cleanup by gating .finally() cleanup on a generation counter; adds telemetry for the stale-cleanup case. |
| packages/runtime/container-runtime/src/test/summary/summaryManager.spec.ts | Adds regression tests covering stop-timeout firing before the summarizer run promise chain completes (and a restart scenario). |
shlevari
left a comment
There was a problem hiding this comment.
@frankmueller-msft is there urgency to completing this PR? I'd like to wait for @kian-thompson to get to take a look at this, and he's out until Thursday. While this does in fact seem like it would prevent the assert being hit, I'm worried that we're just covering up a deeper issue that that assert is there to make sure we see. I want to make sure his original PR is actually fixing what it intended and just missing this case, as opposed to this issue revealing something deeper about the original fix.
|
Sure, we wait for @kian-thompson to sign off on this PR. |
kian-thompson
left a comment
There was a problem hiding this comment.
LGTM. This handling doesn't cover up any deeper issue. There was just an overlap in the cleanup and that code assumed no one else ran it beforehand prior to my change.
Summary
Fixes assert failures 0x263 ("Expected: starting") and 0x264 ("Expected: Not Off") in
SummaryManagerthat occur during Real Service Stress Tests.cleanupAfterSummarizerStop()when the summarizer hangs. But the.finally()block instartSummarization()also callscleanupAfterSummarizerStop()unconditionally when the promise chain completes. When the timeout fires first, the subsequent.finally()performs a redundant cleanup that either hits assert 0x264 (state is alreadyOff) or cascades into assert 0x263 (double-cleanup corrupts a new summarizer cycle's state).summarizerGenerationcounter that is incremented each timecleanupAfterSummarizerStop()runs. The promise chain captures the generation at the start of eachstartSummarization()call;.finally()skips cleanup if the generation has changed, meaning another path already performed it. Logs aSummarizerCleanupAlreadyDonetelemetry event when this occurs.Race scenario for 0x264
startSummarization(): state → Starting → Runningsummarizer.run()startsstop()→ state = Stopping, 2-minute timeout setcleanupAfterSummarizerStop()→ state = Offsummarizer.run()promise finally resolves.finally()→assert(state !== Off)→ 0x264Race scenario for 0x263
5b. Timeout → cleanup → state = Off →
shouldSummarizetrue →startSummarization()#2 → state = Starting6. Old
.finally()fires → passes 0x264 (Starting ≠ Off) → callscleanupAfterSummarizerStop()again → state = Off7. #2's delay resolves →
assert(state === Starting)→ 0x263 (state is Off)Change details
summaryManager.ts:summarizerGenerationfield (monotonically increasing counter)startSummarization()captures generation at start.finally()checks generation before cleanup; skips with telemetry if stalecleanupAfterSummarizerStop()increments generation on entrysummaryManager.spec.ts:Test plan
AB#59694
🤖 Generated with Claude Code