Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from d0fc897 to 3f0e751 Compare January 9, 2026 00:00
Copy link
Contributor Author

MasterPtato commented Jan 9, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@MasterPtato MasterPtato force-pushed the 01-08-fix_gas_dont_include_error_iters_in_metric branch from ce9397d to a6a1a86 Compare January 9, 2026 00:00
@claude
Copy link

claude bot commented Jan 9, 2026

Code Review

This PR makes several small but important fixes. Overall the changes look good, but I have a few observations and suggestions:

✅ Positive Changes

  1. Actor Error Tracking Fix (actor/mod.rs:1061-1065) - The change to only set crash failure reason for StopCode::Error is correct. Previously, the code checked code != StopCode::Ok, which was redundant since StopCode only has two variants (Ok and Error). The new pattern match is clearer and more explicit about the intent.

  2. Removed Pending Actors Metric - Removing the ACTOR_PENDING_ALLOCATION metric tracking makes sense if it wasn't being used. The metric isn't referenced in any Grafana dashboards, so this cleanup is safe.

  3. TypeScript Error Logging (runner/src/mod.ts:876) - Using stringifyError() instead of direct string interpolation is good practice for consistent error formatting across the codebase.

  4. Added Instrumentation (pegboard-runner/src/conn.rs:257) - Adding #[tracing::instrument(skip_all)] to handle_init improves observability.

  5. Removed Noisy Log (runner2.rs:848-855) - Removing the drain_older_versions activity log makes sense if it was too verbose and not providing value.

🔍 Observations

  1. Metric Definition Still Exists - While the metric is no longer being set, the ACTOR_PENDING_ALLOCATION definition remains in engine/packages/pegboard/src/metrics.rs:4-9. Consider removing the metric definition entirely if it's no longer used anywhere.

  2. Test May Need Review - There's a test named actor_pending_allocation_no_runners in engine/packages/engine/tests/actors_lifecycle.rs:786. While the test itself doesn't appear to check the metric value, the test name suggests it was related to this metric functionality. Verify that the test still provides value without the metric.

  3. Duplicate Changes - The Grafana dashboard title change ("Rivet Guard" → "Guard") appears in 6 files. Per CLAUDE.md, these should be regenerated from the template. Consider verifying that the template at docker/template/ was updated and the configs were regenerated properly rather than manually editing each file.

🐛 Potential Issue

Error in runner2.rs Metric Name (metrics.rs:26-31) - This appears to be a pre-existing bug, but worth noting:

pub static ref SERVERLESS_OUTBOUND_REQ_TOTAL: IntCounterVec = register_int_counter_vec_with_registry!(
    "pegboard_runner_version_upgrade_drain_total",  // ❌ Wrong metric name
    "Count of serverless outbound requests made.",
    &["namespace_id", "runner_name"],
    *REGISTRY
).unwrap();

The metric name is duplicated from RUNNER_VERSION_UPGRADE_DRAIN_TOTAL. Should be pegboard_serverless_outbound_req_total.

📝 Suggestions

  1. Remove the unused ACTOR_PENDING_ALLOCATION metric definition from metrics.rs
  2. Consider renaming or updating the test actor_pending_allocation_no_runners to better reflect what it's testing (pending actor allocation behavior, not the metric)
  3. Fix the SERVERLESS_OUTBOUND_REQ_TOTAL metric name bug (separate PR if needed)

Summary

The changes are well-targeted and improve code clarity. The error tracking fix is particularly good as it makes the crash handling logic more explicit. Just consider the cleanup suggestions above to fully remove traces of the unused metric.

@claude
Copy link

claude bot commented Jan 9, 2026

Code Review - PR #3797

Summary

This PR makes three focused improvements:

  1. Removes the ACTOR_PENDING_ALLOCATION metric and its tracking logic
  2. Fixes the actor error tracking to only set crash failure reasons for actual errors
  3. Improves error message formatting in the TypeScript runner

Detailed Feedback

✅ Strengths

1. Metric Removal (runner.rs & runner2.rs)

  • Clean removal of the pending_actor_count tracking logic
  • Simplifies the allocation transaction by removing unnecessary state tracking
  • The metric is properly removed from all usage sites

2. Actor Error Tracking Fix (actor/mod.rs:1061-1065)

  • Good fix! The previous logic was checking for non-Ok codes but then failing with ensure!, which would cause crashes
  • The new pattern match specifically targets StopCode::Error, which is the correct condition for setting crash failure reasons
  • This aligns with the comment "Runner failure reasons are already set at the start of handle_stopped"

3. TypeScript Error Formatting (runner/src/mod.ts:876)

  • Using stringifyError ensures proper error serialization instead of potentially getting [object Object]
  • Good consistency with other error logging in the file

4. Instrumentation Addition (pegboard-runner/src/conn.rs:257)

  • Adding #[tracing::instrument(skip_all)] to handle_init improves observability

⚠️ Concerns & Suggestions

1. Orphaned Metric Definition

The ACTOR_PENDING_ALLOCATION metric is still defined in engine/packages/pegboard/src/metrics.rs:4-9 but is no longer used anywhere. This should be removed entirely.

// Remove this from metrics.rs
pub static ref ACTOR_PENDING_ALLOCATION: IntGaugeVec = register_int_gauge_vec_with_registry!(
    "pegboard_actor_pending_allocation",
    "Total actors waiting for availability.",
    &["namespace_id", "runner_name"],
    *REGISTRY
).unwrap();

2. Removed Log Statement Context

In runner2.rs:848-855, the drain_older_versions activity had its tracing info statement removed. While removing unnecessary logging can be good, consider whether this was providing useful debugging information. If this was just noisy logging, the removal is fine. If it was helpful for debugging draining operations, consider keeping it or documenting why it was removed.

3. Grafana Dashboard Changes

The title change from "Rivet Guard" to "Guard" is applied across multiple environments. This appears to be an unrelated change to the main PR purpose. Consider:

  • Is this intentional or accidental?
  • Should this be in a separate PR for clarity?
  • Based on CLAUDE.md, these files should be regenerated from docker/template/. Did you regenerate them properly?

4. Missing Test Coverage

The PR doesn't include tests for:

  • The fixed actor error tracking logic (verifying that only StopCode::Error triggers crash failure reasons)
  • The removal of the pending actor count metric (ensuring metrics are still correctly tracked)

Security & Performance

No security concerns identified.

Performance: The removal of pending actor count tracking should provide a minor performance improvement by reducing state management in the allocation loop.

Code Style & Best Practices

Minor issues:

  • The orphaned metric definition should be cleaned up
  • Consider whether the Grafana changes belong in this PR

Verdict

Approve with minor changes recommended:

  1. Remove the unused ACTOR_PENDING_ALLOCATION metric definition from metrics.rs
  2. Verify the Grafana dashboard changes are intentional and properly regenerated
  3. Consider adding tests for the actor error tracking fix

The core fixes are solid and improve the codebase. The actor error tracking fix in particular addresses what appears to be a real bug.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 9, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3797

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3797

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3797

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3797

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3797

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3797

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3797

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@3797

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3797

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3797

commit: 5d2fb4a

@claude
Copy link

claude bot commented Jan 9, 2026

PR Review: Fix remove pending actors metric, fix actor error tracker, engine runner error print

Summary

This PR makes three focused improvements:

  1. Removes inaccurate pending actors metric - Eliminates ACTOR_PENDING_ALLOCATION metric that was counting incorrectly
  2. Fixes actor error tracking - Corrects the condition for tracking crashed actors to only track actual errors
  3. Improves error logging - Better error serialization in TypeScript runner WebSocket error handler

Detailed Analysis

✅ Pending Actors Metric Removal (runner.rs & runner2.rs)

Changes:

  • Removed pending_actor_count tracking variable and associated metric updates
  • Simplified transaction return to just allocations

Assessment:

  • Correct fix: The metric was being calculated inside a transaction that doesnt see the full queue state, making it inherently inaccurate
  • Good simplification: Removes ~20 lines of code that were providing misleading metrics
  • No functional impact: Metric was observability-only; removal doesnt affect allocation logic

Potential concern:

  • Consider if this metric was being used in Grafana dashboards or alerts. If so, those dashboards will show missing data (though the metric was inaccurate anyway)

✅ Actor Error Tracking Fix (actor/mod.rs)

Changes:

// Before (simplified):
if let StoppedVariant::Normal { code, message } = &variant {
    ensure!(
        *code != protocol::mk2::StopCode::Ok,
        "expected non-Ok stop code in crash handler, got Ok"
    );
    // Set failure reason...
}

// After:
if let StoppedVariant::Normal {
    code: protocol::mk2::StopCode::Error,
    message,
} = &variant {
    // Set failure reason...
}

Assessment:

  • Excellent fix: The old code had a logic error - it used ensure!() to validate the stop code wasnt Ok, which would panic if violated
  • More robust: New code uses pattern matching to only execute failure tracking for actual errors (StopCode::Error)
  • Cleaner: Removes the confusing ensure!() macro usage in favor of explicit pattern matching
  • Since StopCode only has two variants (Ok and Error), this correctly identifies actual crashes

Why this matters:

  • Prevents potential panics if a graceful stop somehow reaches this code path
  • More idiomatic Rust - uses pattern matching instead of runtime assertions
  • Follows the comment above the code: "Set Crashed failure reason for actual crashes"

✅ TypeScript Runner Error Logging (mod.ts)

Changes:

// Before:
msg: `WebSocket error: ${ev.error}`

// After:
msg: `WebSocket error: ${stringifyError(ev.error)}`

Assessment:

  • Good improvement: ev.error might not serialize well to string (could be [object Object])
  • Consistent with codebase: stringifyError utility is designed for this purpose
  • Better debugging: Will provide more useful error information in logs

🔧 Minor Issues Found

1. Unused Metric Still Defined (metrics.rs:4-9)
The ACTOR_PENDING_ALLOCATION metric is still defined but no longer used anywhere:

pub static ref ACTOR_PENDING_ALLOCATION: IntGaugeVec = register_int_gauge_vec_with_registry!(
    "pegboard_actor_pending_allocation",
    "Total actors waiting for availability.",
    &["namespace_id", "runner_name"],
    *REGISTRY
).unwrap();

Recommendation: Remove this metric definition entirely since its no longer used and was providing inaccurate data.

2. Duplicate Metric Name (metrics.rs:26-31)
Unrelated to this PR, but I noticed:

pub static ref SERVERLESS_OUTBOUND_REQ_TOTAL: IntCounterVec = register_int_counter_vec_with_registry!(
    "pegboard_runner_version_upgrade_drain_total",  // ⚠️ Wrong name!
    "Count of serverless outbound requests made.",
    &["namespace_id", "runner_name"],
    *REGISTRY
).unwrap();

This metric has the wrong Prometheus name (should be pegboard_serverless_outbound_req_total).

🎯 Additional Observations

Instrumentation Addition (conn.rs:257)

  • Added #[tracing::instrument(skip_all)] to handle_init function
  • Good practice for observability
  • Follows structured logging patterns from CLAUDE.md

Log Removal (runner2.rs:856-863)

  • Removed verbose log statement from drain_older_versions
  • Reasonable cleanup if this was debug logging that is no longer needed
  • All information is already in the span context from the activity

Grafana Dashboard Title (multiple guard.json files)

  • Changed from "Rivet Guard" to "Guard"
  • Minor cosmetic change, no functional impact
  • Possibly for consistency with other dashboard naming

Security Considerations

✅ No security concerns identified. Changes are focused on metrics, logging, and error handling logic.

Performance Considerations

Minor improvement: Removing the pending actor count tracking reduces computation in the hot path (allocation transaction). The metric calculation was adding overhead for data that was inaccurate anyway.

Test Coverage

⚠️ No tests included - Consider adding:

  1. Test case for actor crash handling with StopCode::Error
  2. Test case verifying StopCode::Ok does not trigger crash failure tracking
  3. These would prevent regression of the error tracking logic

Code Quality

Good:

  • Follows Rust idioms (pattern matching over runtime checks)
  • Simplifies code by removing unused metric tracking
  • Improves error handling robustness

⚠️ Could improve:

  • Remove unused ACTOR_PENDING_ALLOCATION metric definition
  • Fix duplicate metric name for SERVERLESS_OUTBOUND_REQ_TOTAL (separate PR)

Final Recommendation

✅ APPROVE with minor suggestions

This PR makes focused, valuable improvements:

  • Fixes a real bug in actor error tracking that could cause panics
  • Removes misleading metrics
  • Improves error logging

Action items before merge:

  1. Remove the unused ACTOR_PENDING_ALLOCATION metric definition from metrics.rs
  2. Consider adding test coverage for the actor crash handling logic

Optional follow-up:

  • Fix the duplicate metric name for SERVERLESS_OUTBOUND_REQ_TOTAL in a separate PR

Great work on identifying and fixing these issues! The actor error tracking fix in particular is an important correctness improvement.

@MasterPtato MasterPtato changed the base branch from 01-08-fix_gas_dont_include_error_iters_in_metric to graphite-base/3797 January 9, 2026 00:38
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 3f0e751 to 1ba3107 Compare January 9, 2026 00:38
@MasterPtato MasterPtato changed the base branch from graphite-base/3797 to main January 9, 2026 00:38
@MasterPtato MasterPtato changed the base branch from main to graphite-base/3797 January 10, 2026 02:00
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 1ba3107 to 22e1cee Compare January 10, 2026 02:00
@vercel
Copy link

vercel bot commented Jan 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
rivet-cloud Ready Ready Preview, Comment Jan 10, 2026 2:02am
rivetkit-serverless Error Error Jan 10, 2026 2:02am
2 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-inspector Ignored Ignored Preview Jan 10, 2026 2:02am
rivet-site Ignored Ignored Preview Jan 10, 2026 2:02am

@MasterPtato MasterPtato changed the base branch from graphite-base/3797 to 01-08-chore_engine_add_support_for_epoxy_debugging January 10, 2026 02:00
@MasterPtato MasterPtato force-pushed the 01-08-chore_engine_add_support_for_epoxy_debugging branch from d789061 to df64d01 Compare January 12, 2026 19:08
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 22e1cee to 2701da3 Compare January 12, 2026 19:08
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 2701da3 to 7057252 Compare January 13, 2026 00:44
@MasterPtato MasterPtato force-pushed the 01-08-chore_engine_add_support_for_epoxy_debugging branch from df64d01 to 0a56839 Compare January 13, 2026 00:44
@MasterPtato MasterPtato force-pushed the 01-08-chore_engine_add_support_for_epoxy_debugging branch from 0a56839 to e11706d Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch 2 times, most recently from 373f6cf to ff1afac Compare January 14, 2026 19:47
@MasterPtato MasterPtato force-pushed the 01-08-chore_engine_add_support_for_epoxy_debugging branch from e11706d to fbe3ebc Compare January 14, 2026 19:47
@MasterPtato MasterPtato force-pushed the 01-08-chore_engine_add_support_for_epoxy_debugging branch from fbe3ebc to d3bc9c1 Compare January 14, 2026 23:02
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from ff1afac to 5d2fb4a Compare January 14, 2026 23:02
@MasterPtato MasterPtato force-pushed the 01-08-chore_engine_add_support_for_epoxy_debugging branch from d3bc9c1 to 31d3705 Compare January 14, 2026 23:39
@MasterPtato MasterPtato force-pushed the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch from 5d2fb4a to 6415162 Compare January 14, 2026 23:39
@graphite-app
Copy link
Contributor

graphite-app bot commented Jan 14, 2026

Merge activity

  • Jan 14, 11:40 PM UTC: MasterPtato added this pull request to the Graphite merge queue.
  • Jan 14, 11:41 PM UTC: CI is running for this pull request on a draft pull request (#3908) due to your merge queue CI optimization settings.
  • Jan 14, 11:42 PM UTC: Merged by the Graphite merge queue via draft PR: #3908.

graphite-app bot pushed a commit that referenced this pull request Jan 14, 2026
@graphite-app graphite-app bot closed this Jan 14, 2026
@graphite-app graphite-app bot deleted the 01-08-fix_remove_pending_actors_metric_fix_actor_error_tracker_engine_runner_error_print branch January 14, 2026 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants