perf(erpc:PLA-1058): reduce metric cardinality#59
perf(erpc:PLA-1058): reduce metric cardinality#590x666c6f wants to merge 1 commit intomorpho-mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Reduces Prometheus histogram series cardinality in eRPC by coarsening high-cost latency histograms and stabilizing cache error labels, with accompanying dashboard/doc/test updates.
Changes:
- Coarsen
erpc_network_request_duration_secondslabels toproject,network,category,finality,outcomeand removeuserfromerpc_upstream_request_duration_seconds. - Normalize cache get/set error label values to
ErrorFingerprint(err)to reduce churn. - Update Grafana/Datadog dashboards, add a cardinality audit note, and add regression tests for histogram label schemas.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| telemetry/metrics.go | Adjusts histogram label sets and help text to reduce cardinality. |
| telemetry/metrics_cardinality_test.go | Adds regression tests asserting the new histogram label schemas. |
| health/tracker.go | Updates upstream duration observer caching to match the new upstream histogram label set. |
| health/tracker_benchmark_test.go | Updates benchmark label binding to match the new upstream histogram label set. |
| erpc/projects.go | Emits `outcome=success |
| architecture/evm/json_rpc_cache.go | Switches cache error label values from ErrorSummary to ErrorFingerprint. |
| monitoring/grafana/dashboards/erpc.json | Updates PromQL queries to match the new histogram labels. |
| monitoring/datadog/dashboard.json | Updates PromQL queries to match the new histogram labels. |
| docs/pages/operation/monitoring.mdx | Documents the intentional histogram coarsening/guardrails. |
| monitoring/cardinality-audit-2026-04.md | Adds an audit note describing the prod snapshot and expected impact. |
Comments suppressed due to low confidence (1)
health/tracker.go:218
getUpstreamRequestDurationObserverstill takesuserIdbut no longer uses it after dropping theuserlabel fromMetricUpstreamRequestDuration. This will fail to compile due to the unused parameter. Remove the parameter (and update call sites), or rename it to_if you want to keep the signature temporarily.
func (t *Tracker) getUpstreamRequestDurationObserver(up common.Upstream, method, composite string, finality common.DataFinalityState, userId string) prometheus.Observer {
key := urdoKey{
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| telemetry.MetricCacheGetErrorTotal.WithLabelValues( | ||
| c.projectId, | ||
| req.NetworkLabel(), | ||
| rpcReq.Method, | ||
| connector.Id(), | ||
| policy.String(), | ||
| policy.GetTTL().String(), | ||
| errorLabel, | ||
| req.UserId(), | ||
| ).Inc() | ||
| telemetry.MetricCacheGetErrorDuration.WithLabelValues( | ||
| c.projectId, | ||
| req.NetworkLabel(), | ||
| rpcReq.Method, | ||
| connector.Id(), | ||
| policy.String(), | ||
| policy.GetTTL().String(), | ||
| errorLabel, | ||
| req.UserId(), | ||
| ).Observe(time.Since(start).Seconds()) |
There was a problem hiding this comment.
The telemetry.MetricCacheGetError* calls in this error branch are indented one level deeper than the surrounding statements (likely from a copy/paste). Running gofmt (or aligning indentation) would prevent noisy diffs and improve readability.
| telemetry.MetricCacheGetErrorTotal.WithLabelValues( | |
| c.projectId, | |
| req.NetworkLabel(), | |
| rpcReq.Method, | |
| connector.Id(), | |
| policy.String(), | |
| policy.GetTTL().String(), | |
| errorLabel, | |
| req.UserId(), | |
| ).Inc() | |
| telemetry.MetricCacheGetErrorDuration.WithLabelValues( | |
| c.projectId, | |
| req.NetworkLabel(), | |
| rpcReq.Method, | |
| connector.Id(), | |
| policy.String(), | |
| policy.GetTTL().String(), | |
| errorLabel, | |
| req.UserId(), | |
| ).Observe(time.Since(start).Seconds()) | |
| telemetry.MetricCacheGetErrorTotal.WithLabelValues( | |
| c.projectId, | |
| req.NetworkLabel(), | |
| rpcReq.Method, | |
| connector.Id(), | |
| policy.String(), | |
| policy.GetTTL().String(), | |
| errorLabel, | |
| req.UserId(), | |
| ).Inc() | |
| telemetry.MetricCacheGetErrorDuration.WithLabelValues( | |
| c.projectId, | |
| req.NetworkLabel(), | |
| rpcReq.Method, | |
| connector.Id(), | |
| policy.String(), | |
| policy.GetTTL().String(), | |
| errorLabel, | |
| req.UserId(), | |
| ).Observe(time.Since(start).Seconds()) |
| telemetry.MetricCacheSetErrorTotal.WithLabelValues( | ||
| c.projectId, | ||
| req.NetworkLabel(), | ||
| rpcReq.Method, | ||
| connector.Id(), | ||
| policy.String(), | ||
| ttl.String(), | ||
| errorLabel, | ||
| req.UserId(), | ||
| ).Inc() | ||
| telemetry.MetricCacheSetErrorDuration.WithLabelValues( | ||
| c.projectId, | ||
| req.NetworkLabel(), | ||
| rpcReq.Method, | ||
| connector.Id(), | ||
| policy.String(), | ||
| ttl.String(), | ||
| errorLabel, | ||
| req.UserId(), | ||
| ).Observe(time.Since(start).Seconds()) |
There was a problem hiding this comment.
In this goroutine branch, the telemetry.MetricCacheSetError* calls appear to have an extra indentation level relative to errorLabel := ... and the surrounding control flow. Please align indentation (gofmt) to keep the block readable and avoid future gofmt-only diffs.
| telemetry.MetricCacheSetErrorTotal.WithLabelValues( | |
| c.projectId, | |
| req.NetworkLabel(), | |
| rpcReq.Method, | |
| connector.Id(), | |
| policy.String(), | |
| ttl.String(), | |
| errorLabel, | |
| req.UserId(), | |
| ).Inc() | |
| telemetry.MetricCacheSetErrorDuration.WithLabelValues( | |
| c.projectId, | |
| req.NetworkLabel(), | |
| rpcReq.Method, | |
| connector.Id(), | |
| policy.String(), | |
| ttl.String(), | |
| errorLabel, | |
| req.UserId(), | |
| ).Observe(time.Since(start).Seconds()) | |
| telemetry.MetricCacheSetErrorTotal.WithLabelValues( | |
| c.projectId, | |
| req.NetworkLabel(), | |
| rpcReq.Method, | |
| connector.Id(), | |
| policy.String(), | |
| ttl.String(), | |
| errorLabel, | |
| req.UserId(), | |
| ).Inc() | |
| telemetry.MetricCacheSetErrorDuration.WithLabelValues( | |
| c.projectId, | |
| req.NetworkLabel(), | |
| rpcReq.Method, | |
| connector.Id(), | |
| policy.String(), | |
| ttl.String(), | |
| errorLabel, | |
| req.UserId(), | |
| ).Observe(time.Since(start).Seconds()) |
| telemetry.MetricCacheSetErrorTotal.WithLabelValues( | ||
| c.projectId, | ||
| req.NetworkLabel(), | ||
| rpcReq.Method, | ||
| connector.Id(), | ||
| policy.String(), | ||
| ttl.String(), | ||
| errorLabel, | ||
| req.UserId(), | ||
| ).Inc() | ||
| telemetry.MetricCacheSetErrorDuration.WithLabelValues( | ||
| c.projectId, | ||
| req.NetworkLabel(), | ||
| rpcReq.Method, | ||
| connector.Id(), | ||
| policy.String(), | ||
| ttl.String(), | ||
| errorLabel, | ||
| req.UserId(), | ||
| ).Observe(time.Since(start).Seconds()) |
There was a problem hiding this comment.
Similar to the earlier cache-set error path, these telemetry.MetricCacheSetError* calls are indented deeper than the enclosing if err != nil block, which makes the flow harder to read. Align indentation (gofmt) so the metric emission clearly sits within the error branch.
| telemetry.MetricCacheSetErrorTotal.WithLabelValues( | |
| c.projectId, | |
| req.NetworkLabel(), | |
| rpcReq.Method, | |
| connector.Id(), | |
| policy.String(), | |
| ttl.String(), | |
| errorLabel, | |
| req.UserId(), | |
| ).Inc() | |
| telemetry.MetricCacheSetErrorDuration.WithLabelValues( | |
| c.projectId, | |
| req.NetworkLabel(), | |
| rpcReq.Method, | |
| connector.Id(), | |
| policy.String(), | |
| ttl.String(), | |
| errorLabel, | |
| req.UserId(), | |
| ).Observe(time.Since(start).Seconds()) | |
| telemetry.MetricCacheSetErrorTotal.WithLabelValues( | |
| c.projectId, | |
| req.NetworkLabel(), | |
| rpcReq.Method, | |
| connector.Id(), | |
| policy.String(), | |
| ttl.String(), | |
| errorLabel, | |
| req.UserId(), | |
| ).Inc() | |
| telemetry.MetricCacheSetErrorDuration.WithLabelValues( | |
| c.projectId, | |
| req.NetworkLabel(), | |
| rpcReq.Method, | |
| connector.Id(), | |
| policy.String(), | |
| ttl.String(), | |
| errorLabel, | |
| req.UserId(), | |
| ).Observe(time.Since(start).Seconds()) |
|
|
||
| Use counters such as `erpc_network_successful_request_total`, `erpc_network_failed_request_total`, and `erpc_upstream_request_total` for high-cardinality drill-downs, and keep `scoreMetricsMode: compact` unless you explicitly need per-method routing telemetry. | ||
|
|
||
| Refer to [erpc/docker-compose.yml](https://github.com/erpc/erpc/blob/main/docker-compose.yml#L4-L17) and [erpc/monitoring](https://github.com/erpc/erpc/tree/main/monitoring) for ready-made templates to bring up montoring. |
There was a problem hiding this comment.
Typo in this sentence: "montoring" should be "monitoring".
| Refer to [erpc/docker-compose.yml](https://github.com/erpc/erpc/blob/main/docker-compose.yml#L4-L17) and [erpc/monitoring](https://github.com/erpc/erpc/tree/main/monitoring) for ready-made templates to bring up montoring. | |
| Refer to [erpc/docker-compose.yml](https://github.com/erpc/erpc/blob/main/docker-compose.yml#L4-L17) and [erpc/monitoring](https://github.com/erpc/erpc/tree/main/monitoring) for ready-made templates to bring up monitoring. |
| }, | ||
| "editorMode": "code", | ||
| "expr": "histogram_quantile(0.99, sum(rate(erpc_network_request_duration_seconds_bucket{network=~\"${network:regex}\",upstream=~\"${upstream:regex}\",category=~\"${category:regex}\",vendor=~\"${vendor:regex}\",vendor!=\"<error>\",project=~\"${project:regex}\",finality=~\"${finality:regex}\",user=~\"${user:regex}\"}[1m])) by (le, network, category, finality))", | ||
| "expr": "histogram_quantile(0.99, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=~\"success|cache\"\n}[1m])) by (le, network, category, finality))", |
There was a problem hiding this comment.
This query references ${cluster:regex} and ${region:regex}, but the dashboard templating variables are cluster_key and client_name (and there is no cluster/region variable defined). As-is, Grafana will not interpolate these and the panel query will fail. Either add corresponding variables to templating.list, or switch these matchers to the existing variables (e.g., cluster_key=~"${cluster_key:regex}") / remove them if not needed.
| "expr": "histogram_quantile(0.99, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=~\"success|cache\"\n}[1m])) by (le, network, category, finality))", | |
| "expr": "histogram_quantile(0.99, sum(rate(erpc_network_request_duration_seconds_bucket{\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=~\"success|cache\"\n}[1m])) by (le, network, category, finality))", |
| }, | ||
| "editorMode": "code", | ||
| "expr": "histogram_quantile(0.5, sum(rate(erpc_network_request_duration_seconds_bucket{network=~\"${network:regex}\",upstream=~\"${upstream:regex}\",category=~\"${category:regex}\",vendor=~\"${vendor:regex}\",vendor!=\"<error>\",project=~\"${project:regex}\",finality=~\"${finality:regex}\",user=~\"${user:regex}\"}[1m])) by (le, network, category, finality))", | ||
| "expr": "histogram_quantile(0.5, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=~\"success|cache\"\n}[1m])) by (le, network, category, finality))", |
There was a problem hiding this comment.
This query includes ${cluster:regex} / ${region:regex}, but those variables aren't defined in templating.list (only cluster_key exists). Update the matchers or add the missing variables; otherwise the panel query will fail.
| "expr": "histogram_quantile(0.5, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=~\"success|cache\"\n}[1m])) by (le, network, category, finality))", | |
| "expr": "histogram_quantile(0.5, sum(rate(erpc_network_request_duration_seconds_bucket{\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=~\"success|cache\"\n}[1m])) by (le, network, category, finality))", |
| }, | ||
| "editorMode": "code", | ||
| "expr": "histogram_quantile(0.99, sum(rate(erpc_network_request_duration_seconds_bucket{network=~\"${network:regex}\",upstream=~\"${upstream:regex}\",category=~\"${category:regex}\",vendor=\"<error>\",project=~\"${project:regex}\",finality=~\"${finality:regex}\",user=~\"${user:regex}\"}[1m])) by (le, network, category, finality, vendor))", | ||
| "expr": "histogram_quantile(0.99, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))", |
There was a problem hiding this comment.
This error-latency query uses ${cluster:regex} and ${region:regex}, which are not defined templating variables in this dashboard. Use the existing cluster_key variable (and add a region variable if desired) to avoid a broken query.
| "expr": "histogram_quantile(0.99, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))", | |
| "expr": "histogram_quantile(0.99, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster_key:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))", |
| "editorMode": "code", | ||
| "expr": "histogram_quantile(0.9, sum(rate(erpc_network_request_duration_seconds_bucket{network=~\"${network:regex}\",upstream=~\"${upstream:regex}\",category=~\"${category:regex}\",vendor=\"<error>\",project=~\"${project:regex}\",finality=~\"${finality:regex}\",user=~\"${user:regex}\"}[1m])) by (le, network, category, finality, vendor))", | ||
| "legendFormat": "{{network}}, {{category}}, {{vendor}}, {{finality}}", | ||
| "expr": "histogram_quantile(0.9, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))", |
There was a problem hiding this comment.
This query references ${cluster:regex} / ${region:regex} but those variables don't exist in the dashboard templating configuration. Please replace with the existing variable names or add the missing template variables.
| "expr": "histogram_quantile(0.9, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))", | |
| "expr": "histogram_quantile(0.9, sum(rate(erpc_network_request_duration_seconds_bucket{\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))", |
| }, | ||
| "editorMode": "code", | ||
| "expr": "histogram_quantile(0.5, sum(rate(erpc_network_request_duration_seconds_bucket{network=~\"${network:regex}\",upstream=~\"${upstream:regex}\",category=~\"${category:regex}\",vendor=\"<error>\",project=~\"${project:regex}\",finality=~\"${finality:regex}\"}[1m])) by (le, network, category, finality, vendor))", | ||
| "expr": "histogram_quantile(0.5, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))", |
There was a problem hiding this comment.
This query also references ${cluster:regex} and ${region:regex} without corresponding dashboard variables. Update to use existing variables (e.g., cluster_key) or add the missing variables so the panel returns data.
| "expr": "histogram_quantile(0.5, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster:regex}\",\n region=~\"${region:regex}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))", | |
| "expr": "histogram_quantile(0.5, sum(rate(erpc_network_request_duration_seconds_bucket{\n cluster=~\"${cluster_key}\",\n region=~\"${region}\",\n network=~\"${network:regex}\",\n category=~\"${category:regex}\",\n project=~\"${project:regex}\",\n finality=~\"${finality:regex}\",\n outcome=\"error\"\n}[1m])) by (le, network, category, finality))", |
| }, | ||
| "editorMode": "code", | ||
| "expr": "histogram_quantile(0.99, sum(rate(erpc_upstream_request_duration_seconds_bucket{network=~\"${network:regex}\",vendor=~\"${vendor:regex}\",category=~\"${category:regex}\",composite=\"none\",project=~\"${project:regex}\",vendor=~\"${vendor:regex}\"}[1m])) by (le, project, vendor, category))", | ||
| "expr": "histogram_quantile(0.99, sum(rate(erpc_upstream_request_duration_seconds_bucket{\n region=~\"${region:regex}\",\n cluster=~\"${cluster:regex}\",\n network=~\"${network:regex}\",\n vendor=~\"${vendor:regex}\",\n category=~\"${category:regex}\",\n composite=\"none\",\n project=~\"${project:regex}\"\n}[1m])) by (le, project, vendor, category))", |
There was a problem hiding this comment.
This upstream latency query filters on ${region:regex} and ${cluster:regex}, but the dashboard doesn't define region or cluster template variables (it defines cluster_key). As written, the query will not interpolate correctly. Use existing variable names / add the missing template variables.
| "expr": "histogram_quantile(0.99, sum(rate(erpc_upstream_request_duration_seconds_bucket{\n region=~\"${region:regex}\",\n cluster=~\"${cluster:regex}\",\n network=~\"${network:regex}\",\n vendor=~\"${vendor:regex}\",\n category=~\"${category:regex}\",\n composite=\"none\",\n project=~\"${project:regex}\"\n}[1m])) by (le, project, vendor, category))", | |
| "expr": "histogram_quantile(0.99, sum(rate(erpc_upstream_request_duration_seconds_bucket{\n region=~\"${region:regex}\",\n cluster=~\"${cluster_key:regex}\",\n network=~\"${network:regex}\",\n vendor=~\"${vendor:regex}\",\n category=~\"${category:regex}\",\n composite=\"none\",\n project=~\"${project:regex}\"\n}[1m])) by (le, project, vendor, category))", |
Summary
erpc_network_request_duration_secondsfromproject,network,vendor,upstream,category,finality,usertoproject,network,category,finality,outcome.userfromerpc_upstream_request_duration_seconds; normalize cache get/set error labels toErrorFingerprint(err).go test ./telemetry ./health ./architecture/evm ./erpc ./upstream -run 'TestUpstreamRequestDurationOmitsUserLabel|TestNetworkRequestDurationUsesOutcomeInsteadOfHighCardinalityLabels',make build,pnpm build, andjq empty monitoring/grafana/dashboards/erpc.json monitoring/datadog/dashboard.json. Broadermake test-fastremains blocked locally by unrelated untracked paths in the worktree (sqd-simplify/,auth/authorizer_test.go).Changes
outcome=success|cache|errorfor network latency instead ofvendor,upstream, anduser.ErrorSummary(err)toErrorFingerprint(err).monitoring/cardinality-audit-2026-04.mdwith prod snapshot, expected impact, and follow-up ticketsPLA-1064andPLA-1065.Metrics Diff Summary
erpc_network_request_duration_seconds:project,network,vendor,upstream,category,finality,user->project,network,category,finality,outcome.erpc_upstream_request_duration_seconds:project,vendor,network,upstream,category,composite,finality,user->project,vendor,network,upstream,category,composite,finality.erpc_cache_get_error_duration_seconds/erpc_cache_set_error_duration_seconds:ErrorSummary(err)->ErrorFingerprint(err).328kactive series overall; top app-side families includederpc_upstream_request_duration_seconds_bucketat about51kseries anderpc_network_request_duration_seconds_bucketat about41k.Linear