feat: add metrics to pb, guard #3815

MasterPtato · 2026-01-10T02:00:44Z

No description provided.

vercel · 2026-01-10T02:00:45Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
rivet-cloud	Ready	Preview, Comment	Jan 10, 2026 2:51am
rivetkit-serverless	Error		Jan 10, 2026 2:51am

2 Skipped Deployments

Project	Deployment	Review	Updated (UTC)
rivet-inspector	Ignored	Preview	Jan 10, 2026 2:51am
rivet-site	Ignored	Preview	Jan 10, 2026 2:51am

MasterPtato · 2026-01-10T02:00:52Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

pkg-pr-new · 2026-01-10T02:03:10Z

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3815

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3815

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3815

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3815

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3815

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3815

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3815

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@3815

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3815

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3815

commit: 3da23f7

engine/packages/metrics-aggregator/src/lib.rs

claude · 2026-01-14T23:05:14Z

PR Review: Add metrics to pb, guard

This PR adds comprehensive metrics instrumentation to pegboard, guard, and related components. Overall, the implementation is solid and follows good observability practices. Here are my findings:

✅ Strengths

Well-structured metrics aggregator: The new metrics-aggregator package provides a clean, dedicated service for scanning database state and aggregating metrics
Proper metric types: Using IntGauge for gauge metrics (pending actors, active connections) and IntCounter for counters (totals, events)
Timeout handling: The aggregator includes EARLY_TXN_TIMEOUT to prevent long-running transactions from blocking
Consistent naming: Metrics follow a clear naming convention with proper prefixes (pegboard_, guard_, gasoline_, udb_)
Good use of labels: Metrics include useful labels like namespace_id, runner_name, protocol_version for filtering

🔍 Issues & Recommendations

1. Inconsistent router label casing (engine/packages/guard/src/routing/mod.rs:127-144)

Issue: Mixing uppercase and lowercase for the same router labels will create separate time series in Prometheus, making queries confusing. This appears to be a typo.

Lines 72 and 82 use lowercase gateway and api, but lines 127 and 144 use uppercase GATEWAY and API.

Fix: Use consistent casing (recommend lowercase) throughout.

2. Commented-out aggregate_actor_active function (engine/packages/metrics-aggregator/src/lib.rs:116-189)

The aggregate_actor_active function is completely commented out but not removed. This adds ~70 lines of dead code.

Recommendation: Either remove it entirely if not needed, or add a comment explaining why it's disabled and when it will be enabled.

3. Missing integration tests

The new metrics-aggregator package has no tests. Given it scans database state and performs pagination, edge cases should be tested.

Recommendation: Add integration tests, especially for the pagination logic with the last_key tracking.

4. Hard-coded interval (engine/packages/metrics-aggregator/src/lib.rs:23)

The 15-second aggregation interval is hard-coded.

Recommendation: Make this configurable through rivet_config::Config to allow tuning based on scale and performance needs.

5. Metric reset pattern concern (engine/packages/metrics-aggregator/src/lib.rs:44, 193)

Resetting gauges before aggregation can cause brief reporting gaps where metrics show 0 between reset and repopulation. If Prometheus scrapes during this window, it will record incorrect data.

Recommendation: Build new values in local state, then update all metrics atomically.

6. Potential memory issue with unbounded label cardinality

Metrics use namespace_id and runner_name as labels. If the number of unique combinations grows unbounded, this can cause memory issues in Prometheus.

Monitoring: Ensure you have alerting on metric cardinality growth.

7. Error handling in tick function (engine/packages/metrics-aggregator/src/lib.rs:28)

If tick returns an error, the entire aggregator service crashes. This might be too aggressive for metric collection failures.

Recommendation: Log errors and continue rather than crashing the service.

🔒 Security

No security concerns identified. The aggregator uses read-only snapshot isolation for database queries.

⚡ Performance

Good: Using StreamingMode::WantAll for efficient batch fetching
Good: Pagination with last_key prevents loading entire keyspace
Good: Using Snapshot isolation level avoids blocking writes
Concern: Running 3 full keyspace scans every 15 seconds could be expensive at scale. Monitor database load.

📝 Code Style

Minor style issues:

Cargo.toml formatting change appears automated (switching to TOML formatting) - this is fine but creates a large diff
Consistent with CLAUDE.md guidelines
Proper use of structured logging with tracing

📊 Test Coverage

Missing: No tests for the new metrics-aggregator package

Summary

This is a solid metrics instrumentation PR. The main issues are:

Critical: Inconsistent router label casing (will break metrics)
Important: Metric reset pattern can cause data gaps
Recommended: Add tests, make interval configurable, improve error handling

The new metrics will provide valuable observability into pegboard and guard systems.

claude · 2026-01-14T23:06:24Z

PR Review: Add Metrics to Pegboard and Guard

Overview

This PR adds comprehensive metrics instrumentation to the Pegboard runner infrastructure and Guard routing system, along with a new metrics-aggregator service that periodically scans database state to compute aggregate metrics.

Code Quality & Best Practices

Strengths ✅

Consistent metrics naming conventions - All metrics follow Prometheus best practices with appropriate prefixes (pegboard_, guard_, gasoline_, udb_)
Good use of lazy_static - Metrics are properly defined as static globals using lazy_static!
Appropriate metric types - Correct use of counters, gauges, and histograms for different metric types
Structured logging - Good use of tracing with structured fields (e.g., ?x pattern)
Proper error handling - Consistent use of anyhow::Result and context enrichment

Issues to Address ⚠️

1. Inconsistent label casing in guard/src/routing/mod.rs

On lines 127 and 144, there's inconsistent label casing:

metrics::ROUTE_TOTAL.with_label_values(&["GATEWAY"]).inc(); // Line 127 - UPPERCASE
metrics::ROUTE_TOTAL.with_label_values(&["API"]).inc();     // Line 144 - UPPERCASE

But earlier uses lowercase:

metrics::ROUTE_TOTAL.with_label_values(&["gateway"]).inc(); // Line 72
metrics::ROUTE_TOTAL.with_label_values(&["runner"]).inc();  // Line 82, 111

Recommendation: Use consistent lowercase labels throughout. Prometheus labels should be lowercase by convention.

2. Missing imports organization

Several files add lazy_static to dependencies but the import pattern varies:

pegboard-runner/src/metrics.rs:3 uses lazy_static::lazy_static!
guard/src/metrics.rs:1 uses use lazy_static::lazy_static;

Recommendation: Standardize on importing at the top as per CLAUDE.md guidelines.

3. Cargo.toml formatting

The Cargo.toml has been completely reformatted from compact to expanded TOML format. While this is valid, it creates a massive diff (421 additions, 352 deletions) that makes the actual dependency changes harder to review.

Observation: The only substantive change is adding rivet-metrics-aggregator to the workspace. The reformatting obscures this.

Potential Bugs 🐛

1. Metrics aggregator transaction timeout handling

In metrics-aggregator/src/lib.rs, the aggregation functions use an early timeout mechanism:

if start.elapsed() > EARLY_TXN_TIMEOUT {
    tracing::warn!("timed out processing pending actors metrics");
    break;
}

Issue: When a timeout occurs mid-scan, the function breaks and returns new_last_key = Vec::new(), which causes the outer loop to exit. This means if the scan times out, it will never resume from where it left off in subsequent ticks - it will always restart from the beginning.

Expected behavior: The code appears to intend resumable scans (note the last_key tracking), but the timeout handling prevents this from working correctly.

Recommendation: Either:

Return new_last_key with the actual last processed key before timeout
Document that timeouts force a full rescan on the next tick
Consider if 2.5 seconds is appropriate for these scans

2. Commented out aggregate_actor_active function

Lines 116-189 in metrics-aggregator/src/lib.rs contain a fully implemented but commented-out aggregate_actor_active function.

Recommendation: Either:

Remove the dead code if not needed
Add a TODO/FIXME comment explaining why it's disabled and when it will be enabled
Enable it if it's ready for production

Performance Considerations ⚡

1. Metrics aggregator scan frequency

The aggregator runs every 15 seconds with a 2.5 second timeout per transaction:

let mut interval = tokio::time::interval(Duration::from_secs(15));
const EARLY_TXN_TIMEOUT: Duration = Duration::from_millis(2500);

Concerns:

Large namespaces could have millions of pending actors - can they be scanned in 2.5s?
The aggregator does a full table scan every 15 seconds which could be expensive at scale
No backpressure mechanism if scans consistently timeout

Recommendations:

Add metrics to track scan duration and completion rate
Consider incremental/sampling strategies for very large datasets
Add alerts if scans consistently timeout

2. Metrics cardinality

Several metrics use namespace_id and runner_name as labels:

.with_label_values(&[&pending_actor_key.namespace_id.to_string(), &pending_actor_key.runner_name_selector])

Concern: High cardinality if there are many namespaces × runner names. UUID-based namespace_ids are particularly problematic for cardinality.

Recommendation:

Monitor cardinality in production
Consider aggregating metrics at a higher level if cardinality becomes problematic
Document expected cardinality in metrics descriptions

3. Metric reset pattern

The aggregator calls .reset() on gauges before scanning:

pegboard::metrics::ACTOR_PENDING_ALLOCATION.reset();

Issue: There's a brief window where metrics are 0 during the scan, which could trigger false alerts or create gaps in dashboards.

Better pattern: Build up a new set of values, then update atomically, or use a separate gauge that gets swapped.

Security Concerns 🔒

No significant security issues identified. The metrics don't expose sensitive data beyond aggregate counts and UUIDs.

Minor note: Runner keys and tokens are handled correctly (not logged or exposed in metrics).

Test Coverage 🧪

Missing:

No tests for the new metrics-aggregator package
No tests verifying metrics are incremented correctly
No tests for timeout/pagination behavior in aggregator

Recommendations:

Add unit tests for parse_actor_path in guard routing (this has complex path parsing logic)
Add integration tests for metrics-aggregator that verify:
- Correct metric values after scanning
- Timeout handling behavior
- Resume behavior with last_key
Add tests for metric increment calls in critical paths

Additional Observations

1. Version downgrade in Cargo.lock

The workspace version went from 2.0.37 → 2.0.33 which is unexpected for a feature addition.

Question: Is this intentional? This could indicate a rebase issue or merge conflict resolution.

2. Gasoline metrics additions

New operation metrics were added to gasoline:

OPERATION_PENDING, OPERATION_TOTAL, OPERATION_DURATION, OPERATION_ERRORS
CPU_USAGE, LOAD_SHEDDING_RATIO

These are excellent additions for observability but aren't mentioned in the PR description.

3. UniversalDB metrics

Added transaction tracking metrics:

TRANSACTION_TOTAL, TRANSACTION_PENDING

Good addition for monitoring database transaction behavior.

Summary

Approve with requested changes

This PR adds valuable observability to critical infrastructure components. The metrics implementation follows good practices overall, but there are several issues that should be addressed:

Must fix:

Fix inconsistent label casing in guard routing metrics (guard/src/routing/mod.rs:127,144)
Fix or document the timeout behavior in metrics-aggregator scans
Resolve the version downgrade from 2.0.37 to 2.0.33

Should fix:
4. Remove or document the commented-out aggregate_actor_active function
5. Add test coverage for the new metrics-aggregator package
6. Consider performance implications of full scans every 15 seconds

Nice to have:
7. Standardize Cargo.toml formatting in a separate PR to reduce diff noise
8. Add metrics to track aggregator scan performance
9. Add cardinality monitoring for namespace_id × runner_name combinations

The core functionality is sound, but addressing the timeout handling and test coverage issues will make this more production-ready.

graphite-app · 2026-01-14T23:40:10Z

Merge activity

Jan 14, 11:40 PM UTC: MasterPtato added this pull request to the Graphite merge queue.
Jan 14, 11:41 PM UTC: CI is running for this pull request on a draft pull request (#3908) due to your merge queue CI optimization settings.
Jan 14, 11:42 PM UTC: Merged by the Graphite merge queue via draft PR: #3908.

This was referenced Jan 10, 2026

fix: clear all pending signals after workflow complete #3798

Closed

fix: remove pending actors metric, fix actor error tracker, engine runner error print #3797

Closed

MasterPtato mentioned this pull request Jan 10, 2026

chore(engine): add support for epoxy debugging #3794

Closed

vercel bot had a problem deploying to Preview – rivetkit-serverless January 10, 2026 02:01 Failure

vercel bot deployed to Preview – rivet-cloud January 10, 2026 02:02 View deployment

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from 649f9ae to a21ab86 Compare January 10, 2026 02:50

MasterPtato marked this pull request as ready for review January 10, 2026 02:50

MasterPtato requested a review from NathanFlurry January 10, 2026 02:50

vercel bot had a problem deploying to Preview – rivetkit-serverless January 10, 2026 02:51 Failure

vercel bot deployed to Preview – rivet-cloud January 10, 2026 02:51 View deployment

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from a21ab86 to 9297e74 Compare January 12, 2026 19:08

MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from 270fad1 to d77de78 Compare January 12, 2026 19:08

MasterPtato mentioned this pull request Jan 12, 2026

fix: set failure reason regardless of crash policy #3835

Closed

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from 9297e74 to 442651f Compare January 12, 2026 19:11

This was referenced Jan 12, 2026

fix(pb): have restart policy actors sleep after pending for too long #3837

Closed

chore: clean up errors #3846

Closed

fix(epoxy): add metrics #3847

Closed

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from 442651f to 7915d6f Compare January 13, 2026 00:27

MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from d77de78 to 4ce772a Compare January 13, 2026 00:44

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch 3 times, most recently from 42cfc6d to 506742d Compare January 13, 2026 01:29

This was referenced Jan 13, 2026

fix: add ready_chunks to worker bumps, bumps per tick metric #3855

Closed

fix(guard): clean up guard #3862

Closed

fix(otel): enrich http traces #3878

Closed

chore(gas): add overview and history docs #3879

Closed

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from 506742d to ca4cbfc Compare January 14, 2026 02:05

MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from 4ce772a to d7548da Compare January 14, 2026 02:05

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from ca4cbfc to 8102256 Compare January 14, 2026 03:06

MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from d7548da to a129485 Compare January 14, 2026 03:06

NathanFlurry requested changes Jan 14, 2026

View reviewed changes

engine/packages/metrics-aggregator/src/lib.rs Outdated Show resolved Hide resolved

MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from a129485 to 45ec780 Compare January 14, 2026 19:47

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch 2 times, most recently from 4b907cb to dc985c0 Compare January 14, 2026 20:01

MasterPtato requested a review from NathanFlurry January 14, 2026 20:01

This was referenced Jan 14, 2026

fix(engine-runner): handle shutdown close codes correctly #3896

Closed

fix(pegboard): cache none runner pool errors too #3899

Closed

NathanFlurry approved these changes Jan 14, 2026

View reviewed changes

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from dc985c0 to f3f7d72 Compare January 14, 2026 22:52

MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from 45ec780 to 8041bfa Compare January 14, 2026 23:02

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from f3f7d72 to ae1c40c Compare January 14, 2026 23:02

MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from 8041bfa to 01b117b Compare January 14, 2026 23:07

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from ae1c40c to 3da23f7 Compare January 14, 2026 23:07

feat: add metrics to pb, guard

aea6392

MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from 01b117b to 821194e Compare January 14, 2026 23:39

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from 3da23f7 to aea6392 Compare January 14, 2026 23:39

graphite-app bot pushed a commit that referenced this pull request Jan 14, 2026

feat: add metrics to pb, guard (#3815)

240f318

graphite-app bot closed this Jan 14, 2026

graphite-app bot deleted the 01-09-feat_add_metrics_to_pb_guard branch January 14, 2026 23:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add metrics to pb, guard #3815

feat: add metrics to pb, guard #3815

Uh oh!

MasterPtato commented Jan 10, 2026

Uh oh!

vercel bot commented Jan 10, 2026 •

edited

Loading

Uh oh!

MasterPtato commented Jan 10, 2026 •

edited

Loading

Uh oh!

pkg-pr-new bot commented Jan 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

claude bot commented Jan 14, 2026

Uh oh!

claude bot commented Jan 14, 2026 •

edited

Loading

Uh oh!

graphite-app bot commented Jan 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add metrics to pb, guard #3815

feat: add metrics to pb, guard #3815

Uh oh!

Conversation

MasterPtato commented Jan 10, 2026

Uh oh!

vercel bot commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MasterPtato commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

pkg-pr-new bot commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

claude bot commented Jan 14, 2026

PR Review: Add metrics to pb, guard

✅ Strengths

🔍 Issues & Recommendations

1. Inconsistent router label casing (engine/packages/guard/src/routing/mod.rs:127-144)

2. Commented-out aggregate_actor_active function (engine/packages/metrics-aggregator/src/lib.rs:116-189)

3. Missing integration tests

4. Hard-coded interval (engine/packages/metrics-aggregator/src/lib.rs:23)

5. Metric reset pattern concern (engine/packages/metrics-aggregator/src/lib.rs:44, 193)

6. Potential memory issue with unbounded label cardinality

7. Error handling in tick function (engine/packages/metrics-aggregator/src/lib.rs:28)

🔒 Security

⚡ Performance

📝 Code Style

📊 Test Coverage

Summary

Uh oh!

claude bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Add Metrics to Pegboard and Guard

Overview

Code Quality & Best Practices

Strengths ✅

Issues to Address ⚠️

Potential Bugs 🐛

Performance Considerations ⚡

Security Concerns 🔒

Test Coverage 🧪

Additional Observations

Summary

Uh oh!

graphite-app bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel bot commented Jan 10, 2026 •

edited

Loading

MasterPtato commented Jan 10, 2026 •

edited

Loading

pkg-pr-new bot commented Jan 10, 2026 •

edited

Loading

claude bot commented Jan 14, 2026 •

edited

Loading

graphite-app bot commented Jan 14, 2026 •

edited

Loading