containers: Fix alarm scheduler spinloop under concurrent callers#191
Open
quantizor wants to merge 2 commits intocloudflare:mainfrom
Open
containers: Fix alarm scheduler spinloop under concurrent callers#191quantizor wants to merge 2 commits intocloudflare:mainfrom
quantizor wants to merge 2 commits intocloudflare:mainfrom
Conversation
04dc96e to
d096ae6
Compare
The `alarm()` handler paired an in-memory `setTimeout` sleep (up to 3 minutes) with an unconditional `setAlarm(Date.now())` on exit. Any external call to `scheduleNextAlarm()` during the sleep resolved the handler's internal Promise via `this.resolve()`, letting the handler proceed to the exit path and overwrite the caller's future alarm timestamp with one scheduled for "now". The runtime would then refire the alarm immediately, the new handler would enter another 3-minute sleep, and the next external caller would short-circuit that one too — producing a self-sustaining cascade of sub-second alarm fires. In production this manifested under `startAndWaitForPorts` retry loops (27 retries at INSTANCE_POLL_INTERVAL_MS=300ms per unhealthy `containerFetch`) and partysocket reconnect storms: alarm cadence fell to ~280ms mean with 7ms minima, matching the 300ms poll interval almost exactly. The saturated DO event loop canceled WebSocket upgrades at 0ms wallclock while HTTP traffic on the same DO continued to succeed. Replace the Promise/setTimeout pattern with direct re-arming on storage: `alarm()` completes its work and sets the storage alarm to the earliest of the next due schedule, `sleepAfter` expiration, or a 3-minute heartbeat, floored at MIN_ALARM_REARM_MS (100ms) so concurrent callers cannot drive cadence below the floor. `scheduleNextAlarm()` is idempotent: if an existing alarm is already set to fire sooner, the call is a no-op, which converges concurrent callers on the earliest requested time instead of having them clobber each other. Adds unit tests in `src/tests/alarm-scheduler.test.ts` covering the regression, durability invariants, and concurrent-caller settlement. Jest config now maps `cloudflare:workers` to a local test stub and uses `clearMocks: true` so the suite exercises the full alarm state machine without a workerd runtime. No behavior change for activity renewal, connection handling, or `onStart`/`onStop` lifecycle hooks.
d096ae6 to
0b63489
Compare
…racking Pins the invariant from cloudflare#147 now that it's fixed on main by commits 45ca64a / c88e29f / 828e8c8: an open WebSocket (or any in-flight proxied request) must keep the container alive via the `inflightRequests` counter even after the `sleepAfter` window has elapsed. The reverse path is covered too — once the counter drops to zero and `sleepAfterMs` is past, `onActivityExpired` fires. Red-checked by reverting the `inflightRequests > 0` branch in `isActivityExpired`: the positive test fails, the negative test passes — exactly what you'd want from a regression net for this class of bug. No library code changes.
This was referenced Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #189.
Symptom
Under load, the
alarm()handler onContainerfires at sub-second cadence instead of the designed 3-minute cadence. The Durable Object event loop saturates, and WebSocket upgrades queued behind the alarm get canceled at 0ms wallclock while HTTP traffic to the same DO continues to succeed.Observed on a single production DO over 30 minutes: 50
eventType=alarmevents with inter-arrival times ranging 7ms to 589ms (mean ~280ms, matchingINSTANCE_POLL_INTERVAL_MSalmost exactly). 100% of 1,000 fetch events in the window hadoutcome=canceled, of which 975/988/wsevents hadwallTime=0ms./api/healththrough the same DO'sfetch()method during the storm succeeded in ~225ms. No deploy churn (singlescriptVersion.idacross the window).Root cause
The
alarm()handler paired an in-memorysetTimeoutsleep (up to 3 minutes) with an unconditionalsetAlarm(Date.now())on exit:Any external call to
scheduleNextAlarm()during the sleep resolved the handler's internal Promise viathis.resolve()and set the storage alarm toDate.now() + 1000:The handler then resumed past its Promise, ran
setAlarm(Date.now())at the exit path, and overwrote the 1-second future alarm with "fire now". The runtime refired the alarm immediately, the new handler entered another 3-minute sleep, and the next external caller short-circuited that one too — a self-sustaining cascade.The most reliable external trigger is
startContainerIfNotRunning's retry loop, which callsscheduleNextAlarm()on every iteration (27 retries ×INSTANCE_POLL_INTERVAL_MS=300ms). One stuckcontainerFetchagainst an unhealthy container produces ~27scheduleNextAlarm()calls at 300ms intervals, which matches the observed 280ms mean cadence.Fix
Replace the Promise/setTimeout pattern with direct storage-backed re-arming:
alarm()completes its work and sets the storage alarm tomin(next due schedule, sleepAfterMs, Date.now() + MAX_ALARM_REARM_MS), floored atDate.now() + MIN_ALARM_REARM_MS(100ms). No in-memorysetTimeoutsleep. Nothis.resolve/this.timeoutcoordination.scheduleNextAlarm()is idempotent: readsgetAlarm(), no-ops if an existing alarm is already sooner than the request. Concurrent callers converge on the earliest requested time instead of clobbering each other.Tests
Adds 13 unit tests in
src/tests/alarm-scheduler.test.ts. 7 fail on main (including the direct spinloop regression —scheduleNextAlarm during an in-progress alarm does not cause immediate re-fire on exit— and a cadence-invariant test that times out at Jest's 5-second limit because the handler spins). All 13 pass with the fix.The existing
src/tests/container.test.tswas broken on main (couldn't resolvecloudflare:workers) and never ran in CI — thenpm testscript only iteratesexamples/*/test/. Jest config now mapscloudflare:workersto a local test stub (src/tests/__mocks__/cloudflare-workers.ts) and usesclearMocks: truefor between-test spy hygiene. The stalecontainer.test.tsis untouched in this PR.Repro
The bug is deterministic. To reproduce without production traffic:
Container, setsleepAfter = '1h',defaultPortset to a port the container won't listen on.await container.fetch(...)in a loop.ctx.storage.getAlarm()snapshots or CF Workers Observability oneventType=alarm.Expected on main: ~300ms cadence. Expected after the fix: no more frequent than
MIN_ALARM_REARM_MS(100ms), in practice much slower since schedule and sleepAfterMs don't change that frequently.Follow-ups not included
MAX_ALARM_REARM_MSheartbeat in favor of explicitscheduleNextAlarm()calls fromsetupMonitorCallbacksafter container state transitions. Clean but changes whenonStopfires after a crash, so it deserves its own review.sleepAfteroverride applies after the base class's constructor callsrenewActivityTimeout(), so the initialsleepAfterMsis always computed fromDEFAULT_SLEEP_AFTER = '10m'. Self-corrects on the firstcontainerFetch, but the 10-minute initial window can surprise users with longersleepAfter. Orthogonal to the spinloop; worth a separate fix.inflightRequestsleak hardening. If a WebSocket'sclosenever fires (client disconnects abruptly without a close frame), the counter is permanently elevated andisActivityExpired()returns false forever. Not the storm bug, but worth capping or sweeping in a follow-up.