fix(core): wait for replay propagation before suspending#1961
fix(core): wait for replay propagation before suspending#1961karthikscale3 wants to merge 13 commits intomainfrom
Conversation
Adds an e2e regression test that exercises 50 concurrent items each running two sequential steps (search → addResult). The pattern produces the timing skew that triggers the scheduleWhenIdle premature-suspension race seen in production runs (e.g. wrun_01KQ05J17ZJHGZFRYZ20QM1DBS, where 250 steps completed cleanly server-side but the workflow still failed with WorkflowRuntimeError "Unconsumed event in event log" due to scheduleWhenIdle firing WorkflowSuspension before the addResult callback could be registered). The race only manifests reliably when flow handlers run across separate function invocations, so this test should be evaluated against a real Vercel deployment (which CI does for the nextjs-turbopack matrix entry). Co-authored-by: Cursor <cursoragent@cursor.com>
🦋 Changeset detectedLatest commit: f9d27d4 The changes in this PR will be included in the next version bump. This PR includes changesets to release 18 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Express Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
❌ Some benchmark jobs failed:
Check the workflow run for details. |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (3 failed)nextjs-turbopack (3 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
Replaces the simple 50-item × 2-step pattern with the production failure pattern: 80 concurrent items each running 5 nested waves of steps (parallel search reps → sequential addResult → sequential getProjectResults → parallel exa-source loop → sequential getToday + parallel fetchStatus). A few items per wave 1 are stragglers whose searchStep lags 10-15s behind the rest of the batch, mirroring the T97/T9T/T9V pattern from production run wrun_01KQ05J17ZJHGZFRYZ20QM1DBS. This timing skew is what triggers scheduleWhenIdle to fire WorkflowSuspension in the gap between fast hydrations completing (pendingDeliveries → 0) and the next useStep callback registering, leaving the next-wave step's step_created event unclaimed → WorkflowRuntimeError. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Adds focused unit tests for scheduleWhenIdle's fast and deferred paths, plus a local REPLAY_PROPAGATION_DELAY_MS alias and a doc note explaining why the deferred timer does not need cancellation when a follow-up useStep registers mid-wait. Co-authored-by: Cursor <cursoragent@cursor.com>
Move 96_many_steps.ts to the canonical workbench/example/workflows location and add the matching symlinks in every workbench whose workflows directory is a real directory (nextjs-turbopack, nextjs-webpack, nitro-v3, sveltekit). Adapters whose workflows directory is itself a symlink chain (express/fastify/hono/nitro/ nitro-v2/nuxt/tanstack-start/vite via nitro-v3, astro via sveltekit, nest via example) pick the file up automatically through the chain. Drop the `test.skipIf(APP_NAME !== 'nextjs-turbopack')` guard so every adapter in the e2e matrix exercises the regression workflow. Co-authored-by: Cursor <cursoragent@cursor.com>
Switches private.test.ts to vi.useFakeTimers + advanceTimersByTimeAsync so the state-machine assertions are deterministic and instant rather than wall-clock dependent. Uses a getter-based pendingDeliveries stub to avoid sinon-fake-timers' loopLimit when polling iterations would otherwise persist across many 0ms re-polls. Also adds a TODO next to REPLAY_PROPAGATION_DELAY_MS pointing at a future deterministic VM-resumption-in-flight counter so the 100ms heuristic isn't load-bearing forever. Co-authored-by: Cursor <cursoragent@cursor.com>
Keep the high-concurrency replay stress case on the representative Next.js Turbopack lane after CI showed the full adapter matrix timing out or failing under the added load. Co-authored-by: Cursor <cursoragent@cursor.com>
Limit the high-concurrency replay regression to the Vercel-backed Next.js Turbopack lane so local and Windows matrices do not time out under the stress workload. Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Fixes a false-positive
WorkflowRuntimeError: Unconsumed event in event logthat can happen during high-concurrency replay when many parallel branches advance into follow-up sequential steps.Problem
pendingDeliveriescan briefly drop to0after a step result finishes hydrating, before the workflow VM has resumed across the VM boundary and registered callbacks for the next wave ofuseStepcalls.scheduleWhenIdletreated that transient0as truly idle and firedWorkflowSuspensionimmediately. In that case replay could abort before the next-wave step callback was registered, leaving an existingstep_createdevent unclaimed. The deferred unconsumed-event check then failed the run withWorkflowRuntimeErroreven though the event log and step execution were valid.Fix
Update
scheduleWhenIdleso idle suspension waits for replay delivery propagation before firing:promiseQueueto drain.setTimeout(0)so cross-VM promise continuations can run.promiseQueueagain.pendingDeliveriesand loop if new deliveries appeared.EventsConsumer'sDEFERRED_CHECK_DELAY_MS) before suspending, then re-checkpendingDeliveriesone more time.The non-zero deferred delay only applies after
scheduleWhenIdlehas seen replay deliveries in flight. That gives follow-upuseStepcallbacks time to register after hydrated results cross the VM boundary, while preserving the fast suspension path for ordinary new-work scheduling where no replay delivery was active.Latency note
The added wait is bounded to "idle cycles that observed deliveries", which in practice is at most one per replay round per scheduling site, not per step. Cold-start single-step workflows and ordinary new-work suspensions are unaffected. Worth watching CI/prod data, but it should not be a meaningful added cost relative to the I/O involved in writing
step_createdevents.Why no cancellation hook
Unlike
EventsConsumer's deferred unconsumed-event check, this propagation timer is intentionally not cancelled when a follow-upuseStep/hook/sleep registers during the wait. If a callback arrives mid-wait and consumes the pending*_createdevent, the suspension still fires after the delay, but it is harmless: the matching invocation already hashasCreatedEvent=true, so the suspension handler does not re-create the step and the run simply continues replay from the persisted log.Tests
packages/core/src/private.test.ts) for thescheduleWhenIdlestate machine: covers (a) the fast path firing immediately when no deliveries are observed, (b) the deferred path firing after the propagation window when deliveries were observed, (c) re-looping whenpendingDeliveriesreappears mid-wait, and (d) continued polling while deliveries persist.96_many_steps.ts::concurrentMultiWaveWorkflowwith many concurrent branches and multiple sequential waves. The workflow lives inworkbench/example/workflows/(canonical) with symlinks across the rest of the matrix.nextjs-turbopacklane (APP_NAME=nextjs-turbopackwithWORKFLOW_VERCEL_ENVset), which is the representative adapter/environment for the production-shaped replay race. What we observed in CI when it ran more broadly: local dev/prod/postgres and Windows lanes timed out at the 600s test limit with the workflow still running, while some full-matrix Vercel adapter lanes failed inside the stress workflow withCannot read properties of undefined (reading 'map'). That makes broader execution a CI blocker and suggests a separate high-load stress behavior to investigate outside this targeted runtime fix.pnpm --filter @workflow/core buildandpnpm --filter @workflow/core testpass locally.