Skip to content

Enable dynamic toggling of adaptive routing stats metrics export#18135

Draft
timothy-e wants to merge 2 commits intoapache:masterfrom
timothy-e:timothye-enable-adaptive-routing-metrics-dynamic-toggle
Draft

Enable dynamic toggling of adaptive routing stats metrics export#18135
timothy-e wants to merge 2 commits intoapache:masterfrom
timothy-e:timothye-enable-adaptive-routing-metrics-dynamic-toggle

Conversation

@timothy-e
Copy link
Copy Markdown
Contributor

#18134 adds metrics for adaptive routing stats.

We may want to toggle whether or not stats are exported as metrics without needing to rolling restart the brokers.

ServerRoutingStatsManager now implements PinotClusterConfigChangeListener. The periodic export task is always scheduled when stats collection is enabled; the enable.stats.metric.export flag is checked inside the task and can be updated at runtime via the Helix cluster config (PUT /cluster/configs) without a broker restart.

We also allow updating the metrics export frequency at runtime.

Testing

Deployed to an internal Stripe cluster.

sshed onto one of the controllers and ran

curl -X POST localhost:9000/cluster/configs -H "Content-Type: application/json" -d '{"pinot.broker.adaptive.server.selector.enable.stats.metric.export": "false"}'
# wait a bit
curl -X POST localhost:9000/cluster/configs -H "Content-Type: application/json" -d '{"pinot.broker.adaptive.server.selector.enable.stats.metric.export": "true"}'
# wait a bit more
curl -X POST localhost:9000/cluster/configs -H "Content-Type: application/json" -d '{"pinot.broker.adaptive.server.selector.enable.stats.metric.export": "false"}'

Saw the metrics start/stop as expected.

cc stripe-private-oss-forks/pinot-reviewers
r?

When multistage adaptive server selector stats collection is enabled (pinot.broker.adaptive.server.selector.enable.stats.collection=true), the broker tracks per-server routing stats (in-flight request count, latency EMA, hybrid score) in memory but has no way to observe them externally without hitting the broker's debug API. This makes it difficult to monitor the health and behavior of adaptive routing in production.

This PR adds `setValueOfGaugeWithTag(gauge, tag, value)` to `AbstractMetrics`, which composes the metric name as `gaugeName.tag` without going through `getTableName()`. This ensures the server instance ID is always present in the metric name regardless of the `enableTableLevelMetrics` broker config. Also adds `BrokerMetrics.getTagForServer(serverInstanceId)` returning "server.<instanceId>", following the same pattern as `getTagForPreferredPool`.

 The resulting Pinot metric names are of the form: `pinot.broker.adaptiveServerLatencyEma.server.Server_pinotdb1_8098`

**PluginConfig.yaml + statsd reporter**

Following https://trailhead.corp.stripe.com/docs/stream-analytics-internal/pinot/pinot-observability/pinot-metrics-guide, adds a labeled metric pattern that extracts the server instance ID as a server tag, so the metrics arrive in Veneur/Prometheus as:
  - Name: pinot_broker_adaptive_server_latency_ema
  - Tags: server=Server_pinotdb1_8098 (plus standard host tags)

[STREAMANALYTICS-4390](https://jira.corp.stripe.com/browse/STREAMANALYTICS-4390)
Unit tests to verify that we:
* don't export when adaptive routing stats are disabled
* don't export when adaptive routing stats metric export is disabled
* transform the metric name properly.

[Deployed to QA rad-canary](https://amp.qa.corp.stripe.com/deploy/qa-deploy1.pdx.deploy.stripe.net%2Fdeploy_qBdrdOKZQ7iya8JplJbdaw), and [metrics appeared in grafana](https://g-8916660cfe.grafana-workspace.us-west-2.amazonaws.com/explore?schemaVersion=1&panes=%7B%223yy%22:%7B%22datasource%22:%22zb219lV4k%22,%22queries%22:%5B%7B%22refId%22:%22B%22,%22expr%22:%22%7B__name__%3D~%5C%22pinot_broker_adaptive_server_%28latency_ema%7Cnum_in_flight_requests%7Chybrid_score%29%5C%22,%5Cn%20%20pinot_cluster%3D%5C%22rad-canary%5C%22,host%3D%5C%22qa-pinotdbbroker--06621e6804f2c92db.northwest.stripe.io%5C%22%7D%22,%22range%22:true,%22instant%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22zb219lV4k%22%7D,%22editorMode%22:%22code%22,%22legendFormat%22:%22__auto%22%7D%5D,%22range%22:%7B%22from%22:%221775151966533%22,%22to%22:%221775152978766%22%7D%7D%7D&orgId=1) as `{__name__=~"pinot_broker_adaptive_server_(latency_ema|num_in_flight_requests|hybrid_score)",
  pinot_cluster="rad-canary"}`.

Stripe-Original-Repo: stripe-private-oss-forks/pinot
Stripe-Monotonic-Timestamp: v2/2026-04-07T17:09:47Z/0
Stripe-Original-PR: https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/581
(cherry picked from commit 8d84c8980f2122e5290b15d3889282eee8f5286f)
Committed-By-Agent: claude

cc stripe-private-oss-forks/pinot-reviewers
r?

https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/581 added metrics for adaptive routing stats.

We may want to toggle whether or not stats are exported as metrics without needing to launch a rolling restart.

ServerRoutingStatsManager now implements PinotClusterConfigChangeListener. The periodic export task is always scheduled when stats collection is enabled; the enable.stats.metric.export flag is checked inside the task and can be updated at runtime via the Helix cluster config (PUT /cluster/configs) without a broker restart.

We also allow updating the metrics export frequency at runtime.

[STREAMANALYTICS-4390](https://jira.corp.stripe.com/browse/STREAMANALYTICS-4390)

Deployed to [rad-canary QA](https://amp.qa.corp.stripe.com/deploy/qa-deploy1.pdx.deploy.stripe.net%2Fdeploy_r2WUckQcQ467TfeJvyp4zw).

`ssh`ed onto a rad-canary controller and ran
```
curl -X POST localhost:9000/cluster/configs -H "Content-Type: application/json" -d '{"pinot.broker.adaptive.server.selector.enable.stats.metric.export": "false"}'
curl -X POST localhost:9000/cluster/configs -H "Content-Type: application/json" -d '{"pinot.broker.adaptive.server.selector.enable.stats.metric.export": "true"}'
curl -X POST localhost:9000/cluster/configs -H "Content-Type: application/json" -d '{"pinot.broker.adaptive.server.selector.enable.stats.metric.export": "false"}'
```
I see that the metrics stopped / resumed as expected in [grafana](https://g-8916660cfe.grafana-workspace.us-west-2.amazonaws.com/explore?schemaVersion=1&panes=%7B%223yy%22:%7B%22datasource%22:%22zb219lV4k%22,%22queries%22:%5B%7B%22refId%22:%22B%22,%22expr%22:%22count%20by%20%28host%29%20%28%7B__name__%3D~%5C%22pinot_broker_adaptive_server_latency_ema%5C%22,%20pinot_cluster%3D%5C%22rad-canary%5C%22%7D%29%22,%22range%22:true,%22instant%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22zb219lV4k%22%7D,%22editorMode%22:%22code%22,%22legendFormat%22:%22__auto%22%7D%5D,%22range%22:%7B%22from%22:%221775159834995%22,%22to%22:%221775160734995%22%7D%7D%7D&orgId=1)

<img width="1477" alt="Screenshot 2026-04-02 at 4 17 58 pm" src="https://git.corp.stripe.com/user-attachments/assets/dca06b6d-8b6f-4797-a079-95a1f21d7de7" />

Stripe-Original-Repo: stripe-private-oss-forks/pinot
Stripe-Monotonic-Timestamp: v2/2026-04-07T21:12:57Z/0
Stripe-Original-PR: https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/582
(cherry picked from commit 3abc8faad5987ed824b6b7a11afa5693dbd2c831)
@timothy-e timothy-e force-pushed the timothye-enable-adaptive-routing-metrics-dynamic-toggle branch from d34c964 to 2b4eec6 Compare April 8, 2026 19:22
@timothy-e
Copy link
Copy Markdown
Contributor Author

Based on #18134, but can't do cross-repo stacked PRs in github, so leaving this as a draft with conflicts for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant