Skip to content

fix(rivetkit): drain shutdown work before sleep#4902

Closed
NathanFlurry wants to merge 1 commit intodriver-test-complaints/wire-remote-sqlite-execfrom
driver-test-complaints/fix-shutdown-drain
Closed

fix(rivetkit): drain shutdown work before sleep#4902
NathanFlurry wants to merge 1 commit intodriver-test-complaints/wire-remote-sqlite-execfrom
driver-test-complaints/fix-shutdown-drain

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented May 3, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 3, 2026

Review: fix(rivetkit): drain shutdown work before sleep

Overview

This PR correctly addresses a lifecycle gap where an actor entering sleep could finalize before in-flight waitUntil / shutdown-task work finished. The main changes are:

  1. Abort signal reset between generationsabort_signal and queue_abort_signal are now Mutex<CancellationToken> so a fresh token can be swapped in when an actor restarts without recreating the context.
  2. Sleep vs. destroy finalization distinctioncan_finalize_sleep() is replaced by can_finalize_shutdown(reason: ShutdownKind), adding the condition !self.run_handler_active() for sleep only so the run handler drains naturally instead of being cut.
  3. Drain-before-cancel teardownteardown_sleep_state now loops until tracked work finishes (or until the grace deadline fires and shutdown_deadline_reached is set), then switches to cancellation.
  4. Wasm shutdown task trackingLocalShutdownTask adds an AbortHandle + oneshot::Receiver<()> per wasm task so teardown can either drain or cancel with the same policy as native.

Issues

Missing test coverage for the new sleep-drain path

The existing tests cover destroy timeout (destroy_shutdown_times_out_at_deadline_and_aborts_stuck_shutdown_task) but there is no test verifying:

  • A waitUntil started before sleep blocks finalization until it completes.
  • An actor that restarts after sleep correctly resets abort_signal (i.e. queue waits in the second generation are not immediately aborted).
  • Sleep with a stuck waitUntil eventually cancels it after the grace deadline, mirroring the destroy deadline test.

These scenarios are the direct motivation for the PR and are the most likely regression sites.

Non-wasm teardown_sleep_state loop inconsistency under abort_remaining

In the non-wasm path, when abort_remaining = true, after shutdown_tasks.shutdown().await returns the outer loop iterates again — this is correct. However, unlike the wasm path (which returns after aborting), the native path does not early-return. Since teardown_started is set at the top before the #[cfg] blocks, new spawns abort immediately and the loop does terminate. But the asymmetry with the wasm path is a readability hazard. Consider adding an explicit return after the abort sweep in the native path to match wasm's structure.

runtime.kind === "wasm" branch in native.ts breaks runtime abstraction

if (runtime.kind === "wasm") {
    await runtime.actorSaveState(ctx, actorCtx.serializeForTick("save"));
} else {
    await actorCtx.saveState({ immediate: true });
}

This leaks runtime knowledge into the onSleep callback. If a third runtime variant is added, this silently falls through to the native path. A saveStateForSleep(ctx, actorCtx) method on the CoreRuntime interface (with per-variant implementations) would isolate the branching behind the existing abstraction boundary.

reset_abort_signal_for_start acquires two locks in sequence

pub(crate) fn reset_abort_signal_for_start(&self) {
    let mut abort_signal = self.0.abort_signal.lock();     // lock 1
    ...
    *self.0.queue_abort_signal.lock() = next_signal;       // lock 2
}

No other code path acquires both locks simultaneously, so there is no current deadlock risk. A // lock order: abort_signal then queue_abort_signal comment would prevent future inversions.

Comment style (minor)

Per project convention, comments should explain WHY, not WHAT. In task.rs:

// A restarted actor is a new generation. Clear shutdown state that was
// only meant to stop the previous generation.

"Clear shutdown state" describes the action. Consider: "Shutdown state from the previous generation must not carry into the new one."


Correct patterns worth calling out

  • Guard drop before reset_sleep_timer in both native and wasm track_shutdown_task is the right ordering — the timer re-check sees the post-completion counter.
  • shutdown_deadline_reached.swap(false) (not just a load) correctly consumes the flag atomically, so a later generation's teardown does not inherit the previous generation's abort decision.
  • Early-return in reset_abort_signal_for_start when the signal is not yet cancelled avoids churn on first start.
  • The Abortable + oneshot pattern for wasm tasks is a clean workaround for JoinSet unavailability in the wasm runtime.

Summary

The core logic is sound and addresses a real correctness issue. The main asks before merging are: (1) add tests for the sleep-drain scenarios described above, and (2) consider the runtime.kind abstraction leak in native.ts. The loop asymmetry and comment style issues are minor.

@NathanFlurry NathanFlurry force-pushed the driver-test-complaints/fix-shutdown-drain branch from 2f3266d to 977011c Compare May 3, 2026 07:12
@NathanFlurry NathanFlurry marked this pull request as draft May 3, 2026 07:19
@NathanFlurry NathanFlurry force-pushed the driver-test-complaints/wire-remote-sqlite-exec branch from 3dad87c to 96d2371 Compare May 3, 2026 07:34
@NathanFlurry NathanFlurry force-pushed the driver-test-complaints/fix-shutdown-drain branch from 977011c to 8ce7fbe Compare May 3, 2026 07:34
@NathanFlurry NathanFlurry force-pushed the driver-test-complaints/fix-shutdown-drain branch from 8ce7fbe to ec8b474 Compare May 3, 2026 21:03
@NathanFlurry NathanFlurry force-pushed the driver-test-complaints/wire-remote-sqlite-exec branch from 96d2371 to 86c11aa Compare May 3, 2026 21:03
@NathanFlurry NathanFlurry marked this pull request as ready for review May 3, 2026 23:21
@NathanFlurry NathanFlurry force-pushed the driver-test-complaints/fix-shutdown-drain branch from ec8b474 to 52b2be3 Compare May 3, 2026 23:36
This was referenced May 4, 2026
@NathanFlurry
Copy link
Copy Markdown
Member Author

Landed in main via stack-merge fast-forward push. Commits are in main; closing to match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant