Skip to content

fix(gateway): close stopping actor websockets#4916

Closed
NathanFlurry wants to merge 1 commit into05-02-fix_pegboard_restore_hibernating_request_idsfrom
mock-agentic-loop/gateway-websocket-stopping-close
Closed

fix(gateway): close stopping actor websockets#4916
NathanFlurry wants to merge 1 commit into05-02-fix_pegboard_restore_hibernating_request_idsfrom
mock-agentic-loop/gateway-websocket-stopping-close

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented May 4, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 4, 2026

Review: fix(gateway): close stopping actor websockets

Overview

This PR adds proper WebSocket close frames (instead of hard errors) when connecting to stopping or sleeping actors. Changes span three layers:

  1. Guard routing (pegboard_gateway/mod.rs): Detects sleeping actors that time out while waiting to become ready and routes them to a StoppingWebSocket handler.
  2. Pegboard gateways (pegboard-gateway, pegboard-gateway2): Converts WebSocketServiceUnavailable errors into graceful close frames when an actor stops or the tunnel closes before the WebSocket opens.
  3. rivetkit-core websocket dispatcher (registry/websocket.rs): Adds early-exit checks for stopping/destroying actors before and during WebSocket handling.

The intent is sound - browser clients cannot surface HTTP error codes on failed WebSocket upgrades and need close codes/reasons instead. This aligns with the CLAUDE.md WebSocket rejection policy.


Issues

1. Inconsistent close code: CloseCode::Error (1011) vs semantics

In pegboard-gateway and pegboard-gateway2, actor_stopping_close_frame() uses CloseCode::Error (RFC 6455 code 1011, "Internal Error"). In rivetkit-core/websocket.rs, the raw code 1011 is also used directly. RFC 6455 code 1001 ("Going Away") is semantically more accurate for a server/actor shutting down. CLAUDE.md documents 1008 for auth policy violations, so 1011 stands out here. At minimum, document why 1011 was chosen over 1001.

2. on_message callback missing destroy_requested() check

In rivetkit-core/src/registry/websocket.rs around line 587, the on_message handler guards only:

if !is_hibernatable && (!callback_ctx.started() || callback_ctx.sleep_requested())

But the on_open handler around line 684 checks both sleep_requested() and destroy_requested(). If destroy is triggered between on_open and a subsequent message, incoming messages will not trigger a close frame. The on_message check should also test callback_ctx.destroy_requested(), or add a comment explaining why destroying actors are handled differently at the message layer.

3. Guard-level StoppingWebSocket only activates for sleeping actors on timeout

In handle_actor_v2 and handle_actor_v1, the graceful WebSocket close path fires only when actor.sleeping is true at the ready timeout. If a non-sleeping actor fails to become ready within the timeout (e.g. slow cold-start), the code falls through to Err(errors::ActorReadyTimeout) - a hard error rather than a WebSocket close frame. This may be intentional (non-sleeping timeouts indicate unexpected failures rather than graceful stops), but a comment explaining the distinction would clarify intent.

4. Duplicate actor_stopping_close_frame function

The function is defined identically in both engine/packages/pegboard-gateway/src/lib.rs and engine/packages/pegboard-gateway2/src/lib.rs. Minor maintainability concern - if these crates share a utility module, consider de-duplicating.

5. StoppingWebSocket::handle_request returns HTTP 503

CLAUDE.md says not to reject WebSocket upgrades with HTTP status codes before the upgrade. The 503 path in StoppingWebSocket::handle_request is only reached for non-WebSocket HTTP requests to an actor endpoint, so it is technically correct. A brief comment clarifying it is for plain-HTTP (not WS upgrade) requests would prevent a future reader from flagging it as a convention violation.


Minor observations

  • The log message "sleeping actor did not become ready before websocket wait timeout" fires after the generic actor_ready_timeout, not a WS-specific timer. Consider "actor did not become ready before ready timeout" to avoid implying a separate WS timer.
  • The three-check pattern in rivetkit-core (before handler setup + on_open + on_message) provides good defense in depth but is verbose without explanation. A comment like "actor state may change between dispatch and connection open" would help future readers.
  • No tests are provided or checked in the PR checklist. Consider adding integration test coverage for the WebSocket-to-stopping-actor scenario.

@NathanFlurry
Copy link
Copy Markdown
Member Author

Closing again — going to redo this as a smaller change at the top of the stack that only fires on shutdown finalize (not during onSleep / onDestroy).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant