Skip to content

fix(container-runtime): Fix stop timeout race in SummaryManager (0x263, 0x264)#26557

Merged
frankmueller-msft merged 1 commit intomicrosoft:mainfrom
frankmueller-msft:fix/summary-manager-stop-timeout-race
Mar 3, 2026
Merged

fix(container-runtime): Fix stop timeout race in SummaryManager (0x263, 0x264)#26557
frankmueller-msft merged 1 commit intomicrosoft:mainfrom
frankmueller-msft:fix/summary-manager-stop-timeout-race

Conversation

@frankmueller-msft
Copy link
Copy Markdown
Contributor

@frankmueller-msft frankmueller-msft commented Feb 26, 2026

Summary

Fixes assert failures 0x263 ("Expected: starting") and 0x264 ("Expected: Not Off") in SummaryManager that occur during Real Service Stress Tests.

  • Root cause: PR Add timeouts to summarizer stopping #26271 added a 2-minute stop timeout that calls cleanupAfterSummarizerStop() when the summarizer hangs. But the .finally() block in startSummarization() also calls cleanupAfterSummarizerStop() unconditionally when the promise chain completes. When the timeout fires first, the subsequent .finally() performs a redundant cleanup that either hits assert 0x264 (state is already Off) or cascades into assert 0x263 (double-cleanup corrupts a new summarizer cycle's state).
  • Fix: Adds a summarizerGeneration counter that is incremented each time cleanupAfterSummarizerStop() runs. The promise chain captures the generation at the start of each startSummarization() call; .finally() skips cleanup if the generation has changed, meaning another path already performed it. Logs a SummarizerCleanupAlreadyDone telemetry event when this occurs.
  • Tests: Two new tests reproduce both race scenarios (timeout fires while disconnected; timeout fires, reconnects, starts new cycle, then old chain completes).

Race scenario for 0x264

  1. startSummarization(): state → Starting → Running
  2. summarizer.run() starts
  3. Disconnect → stop() → state = Stopping, 2-minute timeout set
  4. Summarizer hangs under stress test load
  5. Timeout firescleanupAfterSummarizerStop() → state = Off
  6. summarizer.run() promise finally resolves
  7. .finally()assert(state !== Off)0x264

Race scenario for 0x263

5b. Timeout → cleanup → state = Off → shouldSummarize true → startSummarization() #2 → state = Starting
6. Old .finally() fires → passes 0x264 (Starting ≠ Off) → calls cleanupAfterSummarizerStop() again → state = Off
7. #2's delay resolves → assert(state === Starting)0x263 (state is Off)

Change details

summaryManager.ts:

  • New summarizerGeneration field (monotonically increasing counter)
  • startSummarization() captures generation at start
  • .finally() checks generation before cleanup; skips with telemetry if stale
  • cleanupAfterSummarizerStop() increments generation on entry

summaryManager.spec.ts:

  • "Should not assert when stop timeout fires before promise chain completes" — reproduces 0x264
  • "Should not corrupt new cycle when stop timeout fires and shouldSummarize is true" — reproduces 0x263 cascade

Test plan

  • All 14 Summary Manager unit tests pass (ESM + CJS), including 2 new tests
  • CI passes
  • Stress tests no longer hit 0x263/0x264

AB#59694

🤖 Generated with Claude Code

…rts 0x263, 0x264)

PR microsoft#26271 added a 2-minute stop timeout that calls cleanupAfterSummarizerStop()
if the summarizer hangs. However, the .finally() block in startSummarization()
also calls cleanupAfterSummarizerStop() unconditionally. When the timeout fires
before the promise chain completes, the .finally() performs a redundant cleanup
that either hits assert 0x264 (state is already Off) or cascades into assert
0x263 (double-cleanup corrupts a new summarizer cycle's state).

Fix: add a summarizerGeneration counter incremented in cleanupAfterSummarizerStop.
The promise chain captures the generation at start; .finally() skips cleanup if
the generation has changed, meaning another path already performed cleanup.

AB#59694

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 26, 2026 05:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race in SummaryManager where the 2-minute stop-timeout cleanup could run before the startSummarization() promise chain’s .finally() cleanup, causing double-cleanup and state-machine corruption (asserts 0x263 / 0x264) observed in stress tests.

Changes:

  • Add a monotonically increasing summarizerGeneration counter to detect stale .finally() cleanup and skip redundant cleanup when another path already cleaned up.
  • Emit SummarizerCleanupAlreadyDone telemetry when stale cleanup is detected.
  • Add two unit tests that reproduce the timeout/double-cleanup races and validate the fix.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
packages/runtime/container-runtime/src/summary/summaryManager.ts Prevents double-cleanup by gating .finally() cleanup on a generation counter; adds telemetry for the stale-cleanup case.
packages/runtime/container-runtime/src/test/summary/summaryManager.spec.ts Adds regression tests covering stop-timeout firing before the summarizer run promise chain completes (and a restart scenario).

Copy link
Copy Markdown

@shlevari shlevari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@frankmueller-msft is there urgency to completing this PR? I'd like to wait for @kian-thompson to get to take a look at this, and he's out until Thursday. While this does in fact seem like it would prevent the assert being hit, I'm worried that we're just covering up a deeper issue that that assert is there to make sure we see. I want to make sure his original PR is actually fixing what it intended and just missing this case, as opposed to this issue revealing something deeper about the original fix.

@frankmueller-msft
Copy link
Copy Markdown
Contributor Author

Sure, we wait for @kian-thompson to sign off on this PR.

Copy link
Copy Markdown
Contributor

@kian-thompson kian-thompson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This handling doesn't cover up any deeper issue. There was just an overlap in the cleanup and that code assumed no one else ran it beforehand prior to my change.

@frankmueller-msft frankmueller-msft merged commit bd2fbd3 into microsoft:main Mar 3, 2026
42 checks passed
@frankmueller-msft frankmueller-msft deleted the fix/summary-manager-stop-timeout-race branch March 3, 2026 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants