Skip to content

perf(erpc:PLA-1058): reduce metric cardinality#59

Draft
0x666c6f wants to merge 1 commit intomorpho-mainfrom
feature/pla-1058-audit-erpc-metric-cardinality-and-histogram-budget
Draft

perf(erpc:PLA-1058): reduce metric cardinality#59
0x666c6f wants to merge 1 commit intomorpho-mainfrom
feature/pla-1058-audit-erpc-metric-cardinality-and-histogram-budget

Conversation

@0x666c6f
Copy link
Copy Markdown
Collaborator

@0x666c6f 0x666c6f commented Apr 2, 2026

Summary

  • Audit and reduce Prometheus series pressure from eRPC latency histograms and cache error labels.
  • Coarsen erpc_network_request_duration_seconds from project,network,vendor,upstream,category,finality,user to project,network,category,finality,outcome.
  • Drop user from erpc_upstream_request_duration_seconds; normalize cache get/set error labels to ErrorFingerprint(err).
  • Update bundled Grafana and Datadog queries, monitoring docs, and add an audit note plus regression tests.
  • Validated with go test ./telemetry ./health ./architecture/evm ./erpc ./upstream -run 'TestUpstreamRequestDurationOmitsUserLabel|TestNetworkRequestDurationUsesOutcomeInsteadOfHighCardinalityLabels', make build, pnpm build, and jq empty monitoring/grafana/dashboards/erpc.json monitoring/datadog/dashboard.json. Broader make test-fast remains blocked locally by unrelated untracked paths in the worktree (sqd-simplify/, auth/authorizer_test.go).

Changes

  • Emit outcome=success|cache|error for network latency instead of vendor, upstream, and user.
  • Keep detailed counters as-is; only coarsen the highest-cost histograms.
  • Switch cache error metrics from ErrorSummary(err) to ErrorFingerprint(err).
  • Refresh bundled dashboards and monitoring docs for the new histogram shapes.
  • Add monitoring/cardinality-audit-2026-04.md with prod snapshot, expected impact, and follow-up tickets PLA-1064 and PLA-1065.

Metrics Diff Summary

  • erpc_network_request_duration_seconds: project,network,vendor,upstream,category,finality,user -> project,network,category,finality,outcome.
  • erpc_upstream_request_duration_seconds: project,vendor,network,upstream,category,composite,finality,user -> project,vendor,network,upstream,category,composite,finality.
  • erpc_cache_get_error_duration_seconds / erpc_cache_set_error_duration_seconds: ErrorSummary(err) -> ErrorFingerprint(err).
  • Prod incident snapshot behind this change: about 328k active series overall; top app-side families included erpc_upstream_request_duration_seconds_bucket at about 51k series and erpc_network_request_duration_seconds_bucket at about 41k.

Linear

@0x666c6f 0x666c6f self-assigned this Apr 2, 2026
Copilot AI review requested due to automatic review settings April 2, 2026 13:11
@linear
Copy link
Copy Markdown

linear bot commented Apr 2, 2026

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces Prometheus histogram series cardinality in eRPC by coarsening high-cost latency histograms and stabilizing cache error labels, with accompanying dashboard/doc/test updates.

Changes:

  • Coarsen erpc_network_request_duration_seconds labels to project,network,category,finality,outcome and remove user from erpc_upstream_request_duration_seconds.
  • Normalize cache get/set error label values to ErrorFingerprint(err) to reduce churn.
  • Update Grafana/Datadog dashboards, add a cardinality audit note, and add regression tests for histogram label schemas.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
telemetry/metrics.go Adjusts histogram label sets and help text to reduce cardinality.
telemetry/metrics_cardinality_test.go Adds regression tests asserting the new histogram label schemas.
health/tracker.go Updates upstream duration observer caching to match the new upstream histogram label set.
health/tracker_benchmark_test.go Updates benchmark label binding to match the new upstream histogram label set.
erpc/projects.go Emits `outcome=success
architecture/evm/json_rpc_cache.go Switches cache error label values from ErrorSummary to ErrorFingerprint.
monitoring/grafana/dashboards/erpc.json Updates PromQL queries to match the new histogram labels.
monitoring/datadog/dashboard.json Updates PromQL queries to match the new histogram labels.
docs/pages/operation/monitoring.mdx Documents the intentional histogram coarsening/guardrails.
monitoring/cardinality-audit-2026-04.md Adds an audit note describing the prod snapshot and expected impact.
Comments suppressed due to low confidence (1)

health/tracker.go:218

  • getUpstreamRequestDurationObserver still takes userId but no longer uses it after dropping the user label from MetricUpstreamRequestDuration. This will fail to compile due to the unused parameter. Remove the parameter (and update call sites), or rename it to _ if you want to keep the signature temporarily.
func (t *Tracker) getUpstreamRequestDurationObserver(up common.Upstream, method, composite string, finality common.DataFinalityState, userId string) prometheus.Observer {
	key := urdoKey{

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +271 to +290
telemetry.MetricCacheGetErrorTotal.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
policy.GetTTL().String(),
errorLabel,
req.UserId(),
).Inc()
telemetry.MetricCacheGetErrorDuration.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
policy.GetTTL().String(),
errorLabel,
req.UserId(),
).Observe(time.Since(start).Seconds())
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The telemetry.MetricCacheGetError* calls in this error branch are indented one level deeper than the surrounding statements (likely from a copy/paste). Running gofmt (or aligning indentation) would prevent noisy diffs and improve readability.

Suggested change
telemetry.MetricCacheGetErrorTotal.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
policy.GetTTL().String(),
errorLabel,
req.UserId(),
).Inc()
telemetry.MetricCacheGetErrorDuration.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
policy.GetTTL().String(),
errorLabel,
req.UserId(),
).Observe(time.Since(start).Seconds())
telemetry.MetricCacheGetErrorTotal.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
policy.GetTTL().String(),
errorLabel,
req.UserId(),
).Inc()
telemetry.MetricCacheGetErrorDuration.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
policy.GetTTL().String(),
errorLabel,
req.UserId(),
).Observe(time.Since(start).Seconds())

Copilot uses AI. Check for mistakes.
Comment on lines +584 to +603
telemetry.MetricCacheSetErrorTotal.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
ttl.String(),
errorLabel,
req.UserId(),
).Inc()
telemetry.MetricCacheSetErrorDuration.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
ttl.String(),
errorLabel,
req.UserId(),
).Observe(time.Since(start).Seconds())
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this goroutine branch, the telemetry.MetricCacheSetError* calls appear to have an extra indentation level relative to errorLabel := ... and the surrounding control flow. Please align indentation (gofmt) to keep the block readable and avoid future gofmt-only diffs.

Suggested change
telemetry.MetricCacheSetErrorTotal.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
ttl.String(),
errorLabel,
req.UserId(),
).Inc()
telemetry.MetricCacheSetErrorDuration.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
ttl.String(),
errorLabel,
req.UserId(),
).Observe(time.Since(start).Seconds())
telemetry.MetricCacheSetErrorTotal.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
ttl.String(),
errorLabel,
req.UserId(),
).Inc()
telemetry.MetricCacheSetErrorDuration.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
ttl.String(),
errorLabel,
req.UserId(),
).Observe(time.Since(start).Seconds())

Copilot uses AI. Check for mistakes.
Comment on lines +688 to +707
telemetry.MetricCacheSetErrorTotal.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
ttl.String(),
errorLabel,
req.UserId(),
).Inc()
telemetry.MetricCacheSetErrorDuration.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
ttl.String(),
errorLabel,
req.UserId(),
).Observe(time.Since(start).Seconds())
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the earlier cache-set error path, these telemetry.MetricCacheSetError* calls are indented deeper than the enclosing if err != nil block, which makes the flow harder to read. Align indentation (gofmt) so the metric emission clearly sits within the error branch.

Suggested change
telemetry.MetricCacheSetErrorTotal.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
ttl.String(),
errorLabel,
req.UserId(),
).Inc()
telemetry.MetricCacheSetErrorDuration.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
ttl.String(),
errorLabel,
req.UserId(),
).Observe(time.Since(start).Seconds())
telemetry.MetricCacheSetErrorTotal.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
ttl.String(),
errorLabel,
req.UserId(),
).Inc()
telemetry.MetricCacheSetErrorDuration.WithLabelValues(
c.projectId,
req.NetworkLabel(),
rpcReq.Method,
connector.Id(),
policy.String(),
ttl.String(),
errorLabel,
req.UserId(),
).Observe(time.Since(start).Seconds())

Copilot uses AI. Check for mistakes.

Use counters such as `erpc_network_successful_request_total`, `erpc_network_failed_request_total`, and `erpc_upstream_request_total` for high-cardinality drill-downs, and keep `scoreMetricsMode: compact` unless you explicitly need per-method routing telemetry.

Refer to [erpc/docker-compose.yml](https://github.com/erpc/erpc/blob/main/docker-compose.yml#L4-L17) and [erpc/monitoring](https://github.com/erpc/erpc/tree/main/monitoring) for ready-made templates to bring up montoring.
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in this sentence: "montoring" should be "monitoring".

Suggested change
Refer to [erpc/docker-compose.yml](https://github.com/erpc/erpc/blob/main/docker-compose.yml#L4-L17) and [erpc/monitoring](https://github.com/erpc/erpc/tree/main/monitoring) for ready-made templates to bring up montoring.
Refer to [erpc/docker-compose.yml](https://github.com/erpc/erpc/blob/main/docker-compose.yml#L4-L17) and [erpc/monitoring](https://github.com/erpc/erpc/tree/main/monitoring) for ready-made templates to bring up monitoring.

Copilot uses AI. Check for mistakes.
},
"editorMode": "code",
"expr": "histogram_quantile(0.99, sum(rate(erpc_network_request_duration_seconds_bucket{network=~\"${network:regex}\",upstream=~\"${upstream:regex}\",category=~\"${category:regex}\",vendor=~\"${vendor:regex}\",vendor!=\"<error>\",project=~\"${project:regex}\",finality=~\"${finality:regex}\",user=~\"${user:regex}\"}[1m])) by (le, network, category, finality))",
"expr": "histogram_quantile(0.99, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=~\"success|cache\"\n}[1m])) by (le, network, category, finality))",
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query references ${cluster:regex} and ${region:regex}, but the dashboard templating variables are cluster_key and client_name (and there is no cluster/region variable defined). As-is, Grafana will not interpolate these and the panel query will fail. Either add corresponding variables to templating.list, or switch these matchers to the existing variables (e.g., cluster_key=~"${cluster_key:regex}") / remove them if not needed.

Suggested change
"expr": "histogram_quantile(0.99, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=~\"success|cache\"\n}[1m])) by (le, network, category, finality))",
"expr": "histogram_quantile(0.99, sum(rate(erpc_network_request_duration_seconds_bucket{\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=~\"success|cache\"\n}[1m])) by (le, network, category, finality))",

Copilot uses AI. Check for mistakes.
},
"editorMode": "code",
"expr": "histogram_quantile(0.5, sum(rate(erpc_network_request_duration_seconds_bucket{network=~\"${network:regex}\",upstream=~\"${upstream:regex}\",category=~\"${category:regex}\",vendor=~\"${vendor:regex}\",vendor!=\"<error>\",project=~\"${project:regex}\",finality=~\"${finality:regex}\",user=~\"${user:regex}\"}[1m])) by (le, network, category, finality))",
"expr": "histogram_quantile(0.5, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=~\"success|cache\"\n}[1m])) by (le, network, category, finality))",
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query includes ${cluster:regex} / ${region:regex}, but those variables aren't defined in templating.list (only cluster_key exists). Update the matchers or add the missing variables; otherwise the panel query will fail.

Suggested change
"expr": "histogram_quantile(0.5, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=~\"success|cache\"\n}[1m])) by (le, network, category, finality))",
"expr": "histogram_quantile(0.5, sum(rate(erpc_network_request_duration_seconds_bucket{\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=~\"success|cache\"\n}[1m])) by (le, network, category, finality))",

Copilot uses AI. Check for mistakes.
},
"editorMode": "code",
"expr": "histogram_quantile(0.99, sum(rate(erpc_network_request_duration_seconds_bucket{network=~\"${network:regex}\",upstream=~\"${upstream:regex}\",category=~\"${category:regex}\",vendor=\"<error>\",project=~\"${project:regex}\",finality=~\"${finality:regex}\",user=~\"${user:regex}\"}[1m])) by (le, network, category, finality, vendor))",
"expr": "histogram_quantile(0.99, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))",
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error-latency query uses ${cluster:regex} and ${region:regex}, which are not defined templating variables in this dashboard. Use the existing cluster_key variable (and add a region variable if desired) to avoid a broken query.

Suggested change
"expr": "histogram_quantile(0.99, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))",
"expr": "histogram_quantile(0.99, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster_key:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))",

Copilot uses AI. Check for mistakes.
"editorMode": "code",
"expr": "histogram_quantile(0.9, sum(rate(erpc_network_request_duration_seconds_bucket{network=~\"${network:regex}\",upstream=~\"${upstream:regex}\",category=~\"${category:regex}\",vendor=\"<error>\",project=~\"${project:regex}\",finality=~\"${finality:regex}\",user=~\"${user:regex}\"}[1m])) by (le, network, category, finality, vendor))",
"legendFormat": "{{network}}, {{category}}, {{vendor}}, {{finality}}",
"expr": "histogram_quantile(0.9, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))",
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query references ${cluster:regex} / ${region:regex} but those variables don't exist in the dashboard templating configuration. Please replace with the existing variable names or add the missing template variables.

Suggested change
"expr": "histogram_quantile(0.9, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))",
"expr": "histogram_quantile(0.9, sum(rate(erpc_network_request_duration_seconds_bucket{\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))",

Copilot uses AI. Check for mistakes.
},
"editorMode": "code",
"expr": "histogram_quantile(0.5, sum(rate(erpc_network_request_duration_seconds_bucket{network=~\"${network:regex}\",upstream=~\"${upstream:regex}\",category=~\"${category:regex}\",vendor=\"<error>\",project=~\"${project:regex}\",finality=~\"${finality:regex}\"}[1m])) by (le, network, category, finality, vendor))",
"expr": "histogram_quantile(0.5, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))",
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query also references ${cluster:regex} and ${region:regex} without corresponding dashboard variables. Update to use existing variables (e.g., cluster_key) or add the missing variables so the panel returns data.

Suggested change
"expr": "histogram_quantile(0.5, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))",
"expr": "histogram_quantile(0.5, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster_key}\",\n region=~\"${region}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))",

Copilot uses AI. Check for mistakes.
},
"editorMode": "code",
"expr": "histogram_quantile(0.99, sum(rate(erpc_upstream_request_duration_seconds_bucket{network=~\"${network:regex}\",vendor=~\"${vendor:regex}\",category=~\"${category:regex}\",composite=\"none\",project=~\"${project:regex}\",vendor=~\"${vendor:regex}\"}[1m])) by (le, project, vendor, category))",
"expr": "histogram_quantile(0.99, sum(rate(erpc_upstream_request_duration_seconds_bucket{\n region=~\"${region:regex}\",\n cluster=~\"${cluster:regex}\",\n network=~\"${network:regex}\",\n vendor=~\"${vendor:regex}\",\n category=~\"${category:regex}\",\n composite=\"none\",\n project=~\"${project:regex}\"\n}[1m])) by (le, project, vendor, category))",
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This upstream latency query filters on ${region:regex} and ${cluster:regex}, but the dashboard doesn't define region or cluster template variables (it defines cluster_key). As written, the query will not interpolate correctly. Use existing variable names / add the missing template variables.

Suggested change
"expr": "histogram_quantile(0.99, sum(rate(erpc_upstream_request_duration_seconds_bucket{\n region=~\"${region:regex}\",\n cluster=~\"${cluster:regex}\",\n network=~\"${network:regex}\",\n vendor=~\"${vendor:regex}\",\n category=~\"${category:regex}\",\n composite=\"none\",\n project=~\"${project:regex}\"\n}[1m])) by (le, project, vendor, category))",
"expr": "histogram_quantile(0.99, sum(rate(erpc_upstream_request_duration_seconds_bucket{\n region=~\"${region:regex}\",\n cluster=~\"${cluster_key:regex}\",\n network=~\"${network:regex}\",\n vendor=~\"${vendor:regex}\",\n category=~\"${category:regex}\",\n composite=\"none\",\n project=~\"${project:regex}\"\n}[1m])) by (le, project, vendor, category))",

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants