containers: Fix alarm scheduler spinloop under concurrent callers by quantizor · Pull Request #191 · cloudflare/containers

quantizor · 2026-04-22T16:18:27Z

Fixes #189.

Symptom

Under load, the alarm() handler on Container fires at sub-second cadence instead of the designed 3-minute cadence. The Durable Object event loop saturates, and WebSocket upgrades queued behind the alarm get canceled at 0ms wallclock while HTTP traffic to the same DO continues to succeed.

Observed on a single production DO over 30 minutes: 50 eventType=alarm events with inter-arrival times ranging 7ms to 589ms (mean ~280ms, matching INSTANCE_POLL_INTERVAL_MS almost exactly). 100% of 1,000 fetch events in the window had outcome=canceled, of which 975/988 /ws events had wallTime=0ms. /api/health through the same DO's fetch() method during the storm succeeded in ~225ms. No deploy churn (single scriptVersion.id across the window).

Root cause

The alarm() handler paired an in-memory setTimeout sleep (up to 3 minutes) with an unconditional setAlarm(Date.now()) on exit:

// before — container.ts 1968–1980
await new Promise<void>(resolve => {
  this.resolve = resolve;
  if (!this.container.running) { resolve(); return; }
  this.timeout = setTimeout(() => { resolve(); }, timeout);
});

await this.ctx.storage.setAlarm(Date.now()); // ← unconditional

Any external call to scheduleNextAlarm() during the sleep resolved the handler's internal Promise via this.resolve() and set the storage alarm to Date.now() + 1000:

// before — scheduleNextAlarm 2012–2023
if (this.timeout) {
  if (this.resolve) this.resolve(); // ← short-circuits alarm()'s sleep
  clearTimeout(this.timeout);
}
await this.ctx.storage.setAlarm(nextTime); // Date.now() + 1000

The handler then resumed past its Promise, ran setAlarm(Date.now()) at the exit path, and overwrote the 1-second future alarm with "fire now". The runtime refired the alarm immediately, the new handler entered another 3-minute sleep, and the next external caller short-circuited that one too — a self-sustaining cascade.

The most reliable external trigger is startContainerIfNotRunning's retry loop, which calls scheduleNextAlarm() on every iteration (27 retries × INSTANCE_POLL_INTERVAL_MS=300ms). One stuck containerFetch against an unhealthy container produces ~27 scheduleNextAlarm() calls at 300ms intervals, which matches the observed 280ms mean cadence.

Fix

Replace the Promise/setTimeout pattern with direct storage-backed re-arming:

alarm() completes its work and sets the storage alarm to min(next due schedule, sleepAfterMs, Date.now() + MAX_ALARM_REARM_MS), floored at Date.now() + MIN_ALARM_REARM_MS (100ms). No in-memory setTimeout sleep. No this.resolve / this.timeout coordination.
scheduleNextAlarm() is idempotent: reads getAlarm(), no-ops if an existing alarm is already sooner than the request. Concurrent callers converge on the earliest requested time instead of clobbering each other.

// after — alarm() exit path
const requested = Math.min(
  minTimeFromSchedules,
  this.sleepAfterMs,
  Date.now() + MAX_ALARM_REARM_MS,
);
const nextAlarm = Math.max(requested, Date.now() + MIN_ALARM_REARM_MS);
await this.ctx.storage.setAlarm(nextAlarm);

// after — scheduleNextAlarm
const requested = Date.now() + Math.max(ms, MIN_ALARM_REARM_MS);
const existing = await this.ctx.storage.getAlarm();
if (existing !== null && existing <= requested) return;
await this.ctx.storage.setAlarm(requested);

Tests

Adds 13 unit tests in src/tests/alarm-scheduler.test.ts. 7 fail on main (including the direct spinloop regression — scheduleNextAlarm during an in-progress alarm does not cause immediate re-fire on exit — and a cadence-invariant test that times out at Jest's 5-second limit because the handler spins). All 13 pass with the fix.

The existing src/tests/container.test.ts was broken on main (couldn't resolve cloudflare:workers) and never ran in CI — the npm test script only iterates examples/*/test/. Jest config now maps cloudflare:workers to a local test stub (src/tests/__mocks__/cloudflare-workers.ts) and uses clearMocks: true for between-test spy hygiene. The stale container.test.ts is untouched in this PR.

Repro

The bug is deterministic. To reproduce without production traffic:

Subclass Container, set sleepAfter = '1h', defaultPort set to a port the container won't listen on.
From a Worker, fire await container.fetch(...) in a loop.
Observe alarm cadence via ctx.storage.getAlarm() snapshots or CF Workers Observability on eventType=alarm.

Expected on main: ~300ms cadence. Expected after the fix: no more frequent than MIN_ALARM_REARM_MS (100ms), in practice much slower since schedule and sleepAfterMs don't change that frequently.

Follow-ups not included

Drop MAX_ALARM_REARM_MS heartbeat in favor of explicit scheduleNextAlarm() calls from setupMonitorCallbacks after container state transitions. Clean but changes when onStop fires after a crash, so it deserves its own review.
Subclass sleepAfter override applies after the base class's constructor calls renewActivityTimeout(), so the initial sleepAfterMs is always computed from DEFAULT_SLEEP_AFTER = '10m'. Self-corrects on the first containerFetch, but the 10-minute initial window can surprise users with longer sleepAfter. Orthogonal to the spinloop; worth a separate fix.
inflightRequests leak hardening. If a WebSocket's close never fires (client disconnects abruptly without a close frame), the counter is permanently elevated and isActivityExpired() returns false forever. Not the storm bug, but worth capping or sweeping in a follow-up.

The `alarm()` handler paired an in-memory `setTimeout` sleep (up to 3 minutes) with an unconditional `setAlarm(Date.now())` on exit. Any external call to `scheduleNextAlarm()` during the sleep resolved the handler's internal Promise via `this.resolve()`, letting the handler proceed to the exit path and overwrite the caller's future alarm timestamp with one scheduled for "now". The runtime would then refire the alarm immediately, the new handler would enter another 3-minute sleep, and the next external caller would short-circuit that one too — producing a self-sustaining cascade of sub-second alarm fires. In production this manifested under `startAndWaitForPorts` retry loops (27 retries at INSTANCE_POLL_INTERVAL_MS=300ms per unhealthy `containerFetch`) and partysocket reconnect storms: alarm cadence fell to ~280ms mean with 7ms minima, matching the 300ms poll interval almost exactly. The saturated DO event loop canceled WebSocket upgrades at 0ms wallclock while HTTP traffic on the same DO continued to succeed. Replace the Promise/setTimeout pattern with direct re-arming on storage: `alarm()` completes its work and sets the storage alarm to the earliest of the next due schedule, `sleepAfter` expiration, or a 3-minute heartbeat, floored at MIN_ALARM_REARM_MS (100ms) so concurrent callers cannot drive cadence below the floor. `scheduleNextAlarm()` is idempotent: if an existing alarm is already set to fire sooner, the call is a no-op, which converges concurrent callers on the earliest requested time instead of having them clobber each other. Adds unit tests in `src/tests/alarm-scheduler.test.ts` covering the regression, durability invariants, and concurrent-caller settlement. Jest config now maps `cloudflare:workers` to a local test stub and uses `clearMocks: true` so the suite exercises the full alarm state machine without a workerd runtime. No behavior change for activity renewal, connection handling, or `onStart`/`onStop` lifecycle hooks.

…racking Pins the invariant from cloudflare#147 now that it's fixed on main by commits 45ca64a / c88e29f / 828e8c8: an open WebSocket (or any in-flight proxied request) must keep the container alive via the `inflightRequests` counter even after the `sleepAfter` window has elapsed. The reverse path is covered too — once the counter drops to zero and `sleepAfterMs` is past, `onActivityExpired` fires. Red-checked by reverting the `inflightRequests > 0` branch in `isActivityExpired`: the positive test fails, the negative test passes — exactly what you'd want from a regression net for this class of bug. No library code changes.

quantizor requested a review from a team as a code owner April 22, 2026 16:18

quantizor force-pushed the fix/alarm-scheduler-spinloop branch 2 times, most recently from 04dc96e to d096ae6 Compare April 22, 2026 16:29

quantizor force-pushed the fix/alarm-scheduler-spinloop branch from d096ae6 to 0b63489 Compare April 22, 2026 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

containers: Fix alarm scheduler spinloop under concurrent callers#191

containers: Fix alarm scheduler spinloop under concurrent callers#191
quantizor wants to merge 2 commits intocloudflare:mainfrom
quantizor:fix/alarm-scheduler-spinloop

quantizor commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

quantizor commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Symptom

Root cause

Fix

Tests

Repro

Follow-ups not included

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

quantizor commented Apr 22, 2026 •

edited

Loading