Skip to content

feat(rivetkit): expose runtime diagnostics#4909

Closed
NathanFlurry wants to merge 1 commit into05-03-chore_depot-client_migrate_to_worker_threadfrom
sqlite-soak/runtime-diagnostics
Closed

feat(rivetkit): expose runtime diagnostics#4909
NathanFlurry wants to merge 1 commit into05-03-chore_depot-client_migrate_to_worker_threadfrom
sqlite-soak/runtime-diagnostics

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 3, 2026

🚅 Deployed to the rivet-pr-4909 environment in rivet-frontend

Service Status Web Updated (UTC)
kitchen-sink ❌ Build Failed (View Logs) Web May 4, 2026 at 4:13 am
website 😴 Sleeping (View Logs) Web May 3, 2026 at 11:03 pm
ladle ❌ Build Failed (View Logs) Web May 3, 2026 at 10:54 pm
frontend-inspector ❌ Build Failed (View Logs) Web May 3, 2026 at 10:49 pm
frontend-cloud ❌ Build Failed (View Logs) Web May 3, 2026 at 10:48 pm
mcp-hub ✅ Success (View Logs) Web May 3, 2026 at 10:48 pm

Copy link
Copy Markdown
Member Author

NathanFlurry commented May 3, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@NathanFlurry NathanFlurry changed the base branch from sqlite-soak/depot-vfs-cache-metrics to graphite-base/4909 May 3, 2026 22:50
@NathanFlurry NathanFlurry force-pushed the sqlite-soak/runtime-diagnostics branch from e602008 to 6728e22 Compare May 3, 2026 22:50
@NathanFlurry NathanFlurry force-pushed the graphite-base/4909 branch from 3d41f6a to 5aa51ce Compare May 3, 2026 22:50
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4909 to 05-03-chore_depot-client_migrate_to_worker_thread May 3, 2026 22:50
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-4909 May 3, 2026 22:50 Destroyed
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 3, 2026

Code Review: feat(rivetkit): expose runtime diagnostics

Overview

This PR adds runtime observability tooling across the RivetKit stack:

  • envoy-client (Rust): actors_notify: Arc<Notify> to signal actor registration/deregistration; new active_actor_count() and wait_actor_registered_then_stopped() on EnvoyHandle; refactored decode_serverless_actor_start_payload as a free function
  • SSE lifecycle events: New SSE_STOPPING_FRAME sent when an actor stops, giving serverless callers a signal to exit the SSE loop
  • rivetkit-core (Rust): malloc_trim(0) after each actor shutdown; action dispatch now holds a keep-awake region until the reply resolves
  • NAPI/TypeScript: SQLite VFS metrics (JsSqliteVfsMetrics), registry diagnostics (diagnostics()), actorClearRuntimeState, and actorSqlMetrics plumbed through all runtime layers
  • native.ts: onSleep and onDestroy callbacks are now always registered (no longer conditional)
  • kitchen-sink example: /debug/memory and /debug/heap-snapshot endpoints; process signal logging; configurable PORT; HTTP timeout tuning

Issues

1. Race condition in wait_actor_registered_then_stopped — notification must be armed before state check

// handle.rs:122–154
let notified = self.shared.actors_notify.notified();
// ... check actor_is_registered ...
tokio::select! {
    _ = notified => {}
    _ = self.wait_stopped() => return,
}

notify_waiters() does not store a permit. From Tokio's docs: "if a Notified was not being polled at the time of the notify_waiters() call, it will miss the notification." Between creating notified and actually polling it inside select!, an actor can register and deregister. Both notify_waiters() calls fire, find no waiting futures, and the signal is lost. The loop then blocks until the next unrelated actor event.

CLAUDE.md is explicit: "Waiters must arm the notification before re-checking the counter so decrement-to-zero cannot race past them."

Fix: call notified.enable() before the state check, then use notified as a &mut reference inside the select:

let mut notified = self.shared.actors_notify.notified();
notified.enable(); // subscribe BEFORE checking state
// ... check actor_is_registered ...
tokio::select! {
    _ = &mut notified => {}
    _ = self.wait_stopped() => return,
}

2. Path traversal in /debug/heap-snapshot

// examples/kitchen-sink/src/server.ts:361–373
const path = c.req.query("path");
if (!path) {
    return c.json({ error: "missing path" }, 400);
}
const writtenPath = v8.writeHeapSnapshot(path);

The caller-controlled path is passed directly to v8.writeHeapSnapshot(), allowing an attacker to write a heap snapshot to any filesystem path (e.g., ?path=/etc/cron.d/backdoor). The SQLITE_MEMORY_SOAK_DIAGNOSTICS guard helps, but the endpoint is still deployed in the binary. At a minimum, validate that path is an absolute path inside a known safe directory with no .. components, or accept only a directory name and let V8 choose the filename.

3. State lock held across active_envoy_actor_count().await in diagnostics()

// rivetkit-napi/src/registry.rs:255–286
let guard = self.state.lock().await;
let diagnostics = match &*guard {
    RegistryState::Serverless(runtime) => JsRegistryDiagnostics {
        envoy_active_actor_count: runtime
            .active_envoy_actor_count()  // locks envoy, then actors
            .await
            ...
    },
    ...
};

diagnostics() holds self.state (a tokio::Mutex) across an .await that itself acquires two more locks (envoy then actors). Every other operation that locks state — including every incoming serverless request — is blocked for the duration. Consider cloning the runtime reference, dropping the state guard, and only then calling active_envoy_actor_count().

4. onSleep now closes the database — undocumented semantic behavior change

// native.ts — new onSleep finally block
} finally {
    await actorCtx.closeDatabase(); // NEW: was not here before
    clearNativeRuntimeState(runtime, ctx);
    await actorCtx.dispose();
}

Previously closeDatabase was called only from onDestroy. It is now called on every sleep. Actors that hold a native SQLite database and sleep frequently will now close and re-open the connection on each sleep/wake cycle, potentially involving remote flush and re-initialization overhead. If intentional, a comment explaining why is warranted; if accidental, it should be reverted.

5. malloc_trim(0) called per-actor shutdown — noisy under load

// rivetkit-core/src/actor/task.rs
fn trim_native_allocator_after_shutdown(actor_id: &str, reason: &str) {
    unsafe extern "C" { fn malloc_trim(pad: usize) -> i32; }
    let rc = unsafe { malloc_trim(0) };
    ...
}

malloc_trim is a process-global glibc operation. Under concurrent actor shutdowns, multiple threads will race to call it, each doing redundant work. Consider rate-limiting this (e.g., at most once per N seconds using an AtomicU64 generation counter) rather than invoking it on every single shutdown.

6. Two impl EnvoyHandle blocks in the same file

After extracting decode_serverless_actor_start_payload as a free function, a second impl EnvoyHandle block opens immediately after it (containing send_kv_request and related private methods). This is valid Rust but a style inconsistency. Move the free function above the first impl block, or merge both impl blocks.


Minor Notes

  • active_actor_count counts by !handle.is_closed(): entries may linger briefly in the registry after deregistration. A comment clarifying whether this means "currently running" vs "registered" would help.
  • normalizeNativeMetrics dual-case lookup supports both camelCase and snake_case fields. A brief comment explaining why both may be present (NAPI conversion timing) would help future readers.
  • /debug/memory has no authentication: acceptable in an example, but worth a note so users don't accidentally ship it open.
  • httpServer.timeout = 0: intentional to avoid premature request kills during long actor operations, but a brief comment would clarify it is deliberate rather than an oversight.
  • New test action_dispatch_blocks_idle_sleep_until_reply: good coverage. The for _ in 0..10 { yield_now().await } polling loop is the flake-prone pattern CLAUDE.md flags — pairing the keep-awake counter drop with a Notify would make the assertion deterministic.

Summary

The feature direction is solid. The two items that need attention before merge are the notification race in wait_actor_registered_then_stopped (correctness, per CLAUDE.md's explicit Notify-arming requirement) and the path traversal in the heap-snapshot endpoint (security). The diagnostics lock-chain is a performance concern worth addressing. The onSleep database-close change should be documented if intentional, or reverted if accidental.

@NathanFlurry NathanFlurry changed the base branch from 05-03-chore_depot-client_migrate_to_worker_thread to graphite-base/4909 May 3, 2026 23:40
@NathanFlurry NathanFlurry force-pushed the sqlite-soak/runtime-diagnostics branch from 6728e22 to 3066210 Compare May 4, 2026 00:04
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-4909 May 4, 2026 00:04 Destroyed
@NathanFlurry NathanFlurry marked this pull request as ready for review May 4, 2026 00:31
@NathanFlurry NathanFlurry force-pushed the graphite-base/4909 branch from 5aa51ce to b5f770f Compare May 4, 2026 00:32
@NathanFlurry NathanFlurry force-pushed the sqlite-soak/runtime-diagnostics branch from 3066210 to 134c537 Compare May 4, 2026 00:32
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4909 to 05-03-chore_depot-client_migrate_to_worker_thread May 4, 2026 00:32
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-4909 May 4, 2026 00:32 Destroyed
@NathanFlurry NathanFlurry force-pushed the 05-03-chore_depot-client_migrate_to_worker_thread branch from b5f770f to 8ffd1e8 Compare May 4, 2026 04:13
@NathanFlurry NathanFlurry force-pushed the sqlite-soak/runtime-diagnostics branch from 134c537 to 0205cf8 Compare May 4, 2026 04:13
@railway-app railway-app Bot temporarily deployed to rivet-frontend / rivet-pr-4909 May 4, 2026 04:13 Destroyed
@NathanFlurry
Copy link
Copy Markdown
Member Author

Landed in main via stack-merge fast-forward push. Commits are in main; closing to match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant