Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Jan 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
rivet-cloud Ready Ready Preview, Comment Jan 10, 2026 2:51am
rivetkit-serverless Error Error Jan 10, 2026 2:51am
2 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-inspector Ignored Ignored Preview Jan 10, 2026 2:51am
rivet-site Ignored Ignored Preview Jan 10, 2026 2:51am

Copy link
Contributor Author

MasterPtato commented Jan 10, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 10, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3815

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3815

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3815

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3815

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3815

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3815

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3815

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@3815

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3815

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3815

commit: 3da23f7

@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from 649f9ae to a21ab86 Compare January 10, 2026 02:50
@MasterPtato MasterPtato marked this pull request as ready for review January 10, 2026 02:50
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from a21ab86 to 9297e74 Compare January 12, 2026 19:08
@MasterPtato MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from 270fad1 to d77de78 Compare January 12, 2026 19:08
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from 9297e74 to 442651f Compare January 12, 2026 19:11
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from 442651f to 7915d6f Compare January 13, 2026 00:27
@MasterPtato MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from d77de78 to 4ce772a Compare January 13, 2026 00:44
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch 3 times, most recently from 42cfc6d to 506742d Compare January 13, 2026 01:29
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from 506742d to ca4cbfc Compare January 14, 2026 02:05
@MasterPtato MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from 4ce772a to d7548da Compare January 14, 2026 02:05
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from ca4cbfc to 8102256 Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from d7548da to a129485 Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from a129485 to 45ec780 Compare January 14, 2026 19:47
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch 2 times, most recently from 4b907cb to dc985c0 Compare January 14, 2026 20:01
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from dc985c0 to f3f7d72 Compare January 14, 2026 22:52
@MasterPtato MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from 45ec780 to 8041bfa Compare January 14, 2026 23:02
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from f3f7d72 to ae1c40c Compare January 14, 2026 23:02
@claude
Copy link

claude bot commented Jan 14, 2026

PR Review: Add metrics to pb, guard

This PR adds comprehensive metrics instrumentation to pegboard, guard, and related components. Overall, the implementation is solid and follows good observability practices. Here are my findings:

✅ Strengths

  1. Well-structured metrics aggregator: The new metrics-aggregator package provides a clean, dedicated service for scanning database state and aggregating metrics
  2. Proper metric types: Using IntGauge for gauge metrics (pending actors, active connections) and IntCounter for counters (totals, events)
  3. Timeout handling: The aggregator includes EARLY_TXN_TIMEOUT to prevent long-running transactions from blocking
  4. Consistent naming: Metrics follow a clear naming convention with proper prefixes (pegboard_, guard_, gasoline_, udb_)
  5. Good use of labels: Metrics include useful labels like namespace_id, runner_name, protocol_version for filtering

🔍 Issues & Recommendations

1. Inconsistent router label casing (engine/packages/guard/src/routing/mod.rs:127-144)

Issue: Mixing uppercase and lowercase for the same router labels will create separate time series in Prometheus, making queries confusing. This appears to be a typo.

Lines 72 and 82 use lowercase gateway and api, but lines 127 and 144 use uppercase GATEWAY and API.

Fix: Use consistent casing (recommend lowercase) throughout.

2. Commented-out aggregate_actor_active function (engine/packages/metrics-aggregator/src/lib.rs:116-189)

The aggregate_actor_active function is completely commented out but not removed. This adds ~70 lines of dead code.

Recommendation: Either remove it entirely if not needed, or add a comment explaining why it's disabled and when it will be enabled.

3. Missing integration tests

The new metrics-aggregator package has no tests. Given it scans database state and performs pagination, edge cases should be tested.

Recommendation: Add integration tests, especially for the pagination logic with the last_key tracking.

4. Hard-coded interval (engine/packages/metrics-aggregator/src/lib.rs:23)

The 15-second aggregation interval is hard-coded.

Recommendation: Make this configurable through rivet_config::Config to allow tuning based on scale and performance needs.

5. Metric reset pattern concern (engine/packages/metrics-aggregator/src/lib.rs:44, 193)

Resetting gauges before aggregation can cause brief reporting gaps where metrics show 0 between reset and repopulation. If Prometheus scrapes during this window, it will record incorrect data.

Recommendation: Build new values in local state, then update all metrics atomically.

6. Potential memory issue with unbounded label cardinality

Metrics use namespace_id and runner_name as labels. If the number of unique combinations grows unbounded, this can cause memory issues in Prometheus.

Monitoring: Ensure you have alerting on metric cardinality growth.

7. Error handling in tick function (engine/packages/metrics-aggregator/src/lib.rs:28)

If tick returns an error, the entire aggregator service crashes. This might be too aggressive for metric collection failures.

Recommendation: Log errors and continue rather than crashing the service.

🔒 Security

No security concerns identified. The aggregator uses read-only snapshot isolation for database queries.

⚡ Performance

  1. Good: Using StreamingMode::WantAll for efficient batch fetching
  2. Good: Pagination with last_key prevents loading entire keyspace
  3. Good: Using Snapshot isolation level avoids blocking writes
  4. Concern: Running 3 full keyspace scans every 15 seconds could be expensive at scale. Monitor database load.

📝 Code Style

Minor style issues:

  • Cargo.toml formatting change appears automated (switching to TOML formatting) - this is fine but creates a large diff
  • Consistent with CLAUDE.md guidelines
  • Proper use of structured logging with tracing

📊 Test Coverage

Missing: No tests for the new metrics-aggregator package

Summary

This is a solid metrics instrumentation PR. The main issues are:

  1. Critical: Inconsistent router label casing (will break metrics)
  2. Important: Metric reset pattern can cause data gaps
  3. Recommended: Add tests, make interval configurable, improve error handling

The new metrics will provide valuable observability into pegboard and guard systems.

@claude
Copy link

claude bot commented Jan 14, 2026

PR Review: Add Metrics to Pegboard and Guard

Overview

This PR adds comprehensive metrics instrumentation to the Pegboard runner infrastructure and Guard routing system, along with a new metrics-aggregator service that periodically scans database state to compute aggregate metrics.

Code Quality & Best Practices

Strengths ✅

  1. Consistent metrics naming conventions - All metrics follow Prometheus best practices with appropriate prefixes (pegboard_, guard_, gasoline_, udb_)
  2. Good use of lazy_static - Metrics are properly defined as static globals using lazy_static!
  3. Appropriate metric types - Correct use of counters, gauges, and histograms for different metric types
  4. Structured logging - Good use of tracing with structured fields (e.g., ?x pattern)
  5. Proper error handling - Consistent use of anyhow::Result and context enrichment

Issues to Address ⚠️

1. Inconsistent label casing in guard/src/routing/mod.rs

On lines 127 and 144, there's inconsistent label casing:

metrics::ROUTE_TOTAL.with_label_values(&["GATEWAY"]).inc(); // Line 127 - UPPERCASE
metrics::ROUTE_TOTAL.with_label_values(&["API"]).inc();     // Line 144 - UPPERCASE

But earlier uses lowercase:

metrics::ROUTE_TOTAL.with_label_values(&["gateway"]).inc(); // Line 72
metrics::ROUTE_TOTAL.with_label_values(&["runner"]).inc();  // Line 82, 111

Recommendation: Use consistent lowercase labels throughout. Prometheus labels should be lowercase by convention.

2. Missing imports organization

Several files add lazy_static to dependencies but the import pattern varies:

  • pegboard-runner/src/metrics.rs:3 uses lazy_static::lazy_static!
  • guard/src/metrics.rs:1 uses use lazy_static::lazy_static;

Recommendation: Standardize on importing at the top as per CLAUDE.md guidelines.

3. Cargo.toml formatting

The Cargo.toml has been completely reformatted from compact to expanded TOML format. While this is valid, it creates a massive diff (421 additions, 352 deletions) that makes the actual dependency changes harder to review.

Observation: The only substantive change is adding rivet-metrics-aggregator to the workspace. The reformatting obscures this.

Potential Bugs 🐛

1. Metrics aggregator transaction timeout handling

In metrics-aggregator/src/lib.rs, the aggregation functions use an early timeout mechanism:

if start.elapsed() > EARLY_TXN_TIMEOUT {
    tracing::warn!("timed out processing pending actors metrics");
    break;
}

Issue: When a timeout occurs mid-scan, the function breaks and returns new_last_key = Vec::new(), which causes the outer loop to exit. This means if the scan times out, it will never resume from where it left off in subsequent ticks - it will always restart from the beginning.

Expected behavior: The code appears to intend resumable scans (note the last_key tracking), but the timeout handling prevents this from working correctly.

Recommendation: Either:

  • Return new_last_key with the actual last processed key before timeout
  • Document that timeouts force a full rescan on the next tick
  • Consider if 2.5 seconds is appropriate for these scans

2. Commented out aggregate_actor_active function

Lines 116-189 in metrics-aggregator/src/lib.rs contain a fully implemented but commented-out aggregate_actor_active function.

Recommendation: Either:

  • Remove the dead code if not needed
  • Add a TODO/FIXME comment explaining why it's disabled and when it will be enabled
  • Enable it if it's ready for production

Performance Considerations ⚡

1. Metrics aggregator scan frequency

The aggregator runs every 15 seconds with a 2.5 second timeout per transaction:

let mut interval = tokio::time::interval(Duration::from_secs(15));
const EARLY_TXN_TIMEOUT: Duration = Duration::from_millis(2500);

Concerns:

  • Large namespaces could have millions of pending actors - can they be scanned in 2.5s?
  • The aggregator does a full table scan every 15 seconds which could be expensive at scale
  • No backpressure mechanism if scans consistently timeout

Recommendations:

  • Add metrics to track scan duration and completion rate
  • Consider incremental/sampling strategies for very large datasets
  • Add alerts if scans consistently timeout

2. Metrics cardinality

Several metrics use namespace_id and runner_name as labels:

.with_label_values(&[&pending_actor_key.namespace_id.to_string(), &pending_actor_key.runner_name_selector])

Concern: High cardinality if there are many namespaces × runner names. UUID-based namespace_ids are particularly problematic for cardinality.

Recommendation:

  • Monitor cardinality in production
  • Consider aggregating metrics at a higher level if cardinality becomes problematic
  • Document expected cardinality in metrics descriptions

3. Metric reset pattern

The aggregator calls .reset() on gauges before scanning:

pegboard::metrics::ACTOR_PENDING_ALLOCATION.reset();

Issue: There's a brief window where metrics are 0 during the scan, which could trigger false alerts or create gaps in dashboards.

Better pattern: Build up a new set of values, then update atomically, or use a separate gauge that gets swapped.

Security Concerns 🔒

No significant security issues identified. The metrics don't expose sensitive data beyond aggregate counts and UUIDs.

Minor note: Runner keys and tokens are handled correctly (not logged or exposed in metrics).

Test Coverage 🧪

Missing:

  • No tests for the new metrics-aggregator package
  • No tests verifying metrics are incremented correctly
  • No tests for timeout/pagination behavior in aggregator

Recommendations:

  1. Add unit tests for parse_actor_path in guard routing (this has complex path parsing logic)
  2. Add integration tests for metrics-aggregator that verify:
    • Correct metric values after scanning
    • Timeout handling behavior
    • Resume behavior with last_key
  3. Add tests for metric increment calls in critical paths

Additional Observations

1. Version downgrade in Cargo.lock

The workspace version went from 2.0.372.0.33 which is unexpected for a feature addition.

Question: Is this intentional? This could indicate a rebase issue or merge conflict resolution.

2. Gasoline metrics additions

New operation metrics were added to gasoline:

  • OPERATION_PENDING, OPERATION_TOTAL, OPERATION_DURATION, OPERATION_ERRORS
  • CPU_USAGE, LOAD_SHEDDING_RATIO

These are excellent additions for observability but aren't mentioned in the PR description.

3. UniversalDB metrics

Added transaction tracking metrics:

  • TRANSACTION_TOTAL, TRANSACTION_PENDING

Good addition for monitoring database transaction behavior.

Summary

Approve with requested changes

This PR adds valuable observability to critical infrastructure components. The metrics implementation follows good practices overall, but there are several issues that should be addressed:

Must fix:

  1. Fix inconsistent label casing in guard routing metrics (guard/src/routing/mod.rs:127,144)
  2. Fix or document the timeout behavior in metrics-aggregator scans
  3. Resolve the version downgrade from 2.0.37 to 2.0.33

Should fix:
4. Remove or document the commented-out aggregate_actor_active function
5. Add test coverage for the new metrics-aggregator package
6. Consider performance implications of full scans every 15 seconds

Nice to have:
7. Standardize Cargo.toml formatting in a separate PR to reduce diff noise
8. Add metrics to track aggregator scan performance
9. Add cardinality monitoring for namespace_id × runner_name combinations

The core functionality is sound, but addressing the timeout handling and test coverage issues will make this more production-ready.

@MasterPtato MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from 8041bfa to 01b117b Compare January 14, 2026 23:07
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from ae1c40c to 3da23f7 Compare January 14, 2026 23:07
@MasterPtato MasterPtato force-pushed the 01-08-fix_clear_all_pending_signals_after_workflow_complete branch from 01b117b to 821194e Compare January 14, 2026 23:39
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from 3da23f7 to aea6392 Compare January 14, 2026 23:39
@graphite-app
Copy link
Contributor

graphite-app bot commented Jan 14, 2026

Merge activity

  • Jan 14, 11:40 PM UTC: MasterPtato added this pull request to the Graphite merge queue.
  • Jan 14, 11:41 PM UTC: CI is running for this pull request on a draft pull request (#3908) due to your merge queue CI optimization settings.
  • Jan 14, 11:42 PM UTC: Merged by the Graphite merge queue via draft PR: #3908.

graphite-app bot pushed a commit that referenced this pull request Jan 14, 2026
@graphite-app graphite-app bot closed this Jan 14, 2026
@graphite-app graphite-app bot deleted the 01-09-feat_add_metrics_to_pb_guard branch January 14, 2026 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants