Skip to content

resource_group: add server-side allocation metrics for Resource Manager#10495

Open
JmPotato wants to merge 1 commit intotikv:masterfrom
JmPotato:resource-control/server-allocation-metrics
Open

resource_group: add server-side allocation metrics for Resource Manager#10495
JmPotato wants to merge 1 commit intotikv:masterfrom
JmPotato:resource-control/server-allocation-metrics

Conversation

@JmPotato
Copy link
Copy Markdown
Member

@JmPotato JmPotato commented Mar 26, 2026

What problem does this PR solve?

Issue Number: ref #10488

What is changed and how does it work?

This is Phase 2 of the Resource Control observability improvements. It adds server-side metrics to the Resource Manager (pkg/mcs/resourcemanager/server/) for visibility into token allocation decisions, slot lifecycle, loan state, and trickle behavior.

Changes:

B2 — Granted tokens histogram

  • resource_manager_server_granted_tokens — histogram of tokens granted per token bucket request
  • Observed in AcquireTokenBuckets gRPC handler after each successful token grant

B3 — Slot lifecycle metrics

  • resource_manager_server_active_slot_count — gauge of active client slots per resource group
  • resource_manager_server_slot_events_total — counter of slot events (created/deleted/expired)
  • Slot events accumulated via slotsCreated/slotsDeleted/slotsExpired fields in GroupTokenBucketState, drained in 1Hz backgroundMetricsFlush
  • Active count read via thread-safe GetSlotMetrics() accessor

B4 — Token loan + trickle duration

  • resource_manager_server_token_loan — gauge of total loan (negative tokens in slots) per group
  • resource_manager_server_trickle_duration_ms — histogram of trickle duration per grant
  • Loan observed in 1Hz tick; trickle duration observed in gRPC handler

All new gauge metrics follow the existing gaugeMetrics caching pattern (pre-resolved WithLabelValues). Cleanup handled in deleteLabelValues.

Check List

Tests

  • Unit test
  • Existing tests verified (1 pre-existing failure: TestCleanUpTicker, confirmed on master)

Summary by CodeRabbit

  • New Features
    • Expanded observability: added metrics for granted token amounts, trickle durations, active slot counts, token loans, and slot lifecycle events (created/deleted/expired).
    • Per-resource-group/keyspace metric tracking and automatic cleanup when groups/keyspaces are removed.
    • Resource group snapshot and event-drain capabilities to surface slot counts, token loan totals, and recent slot events for reporting.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Mar 26, 2026

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot Bot added dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 26, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Mar 26, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign qiuyesuifeng for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 26, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 26, 2026

📝 Walkthrough

Walkthrough

Adds server-side observability for the token-bucket resource manager: new Prometheus metrics, per-group/keyspace metric bookkeeping, metric emission on token grants in the gRPC path, and periodic emission of slot metrics from the manager.

Changes

Cohort / File(s) Summary
Metrics Implementation
pkg/mcs/resourcemanager/server/metrics.go
Add Prometheus metric vectors (granted_tokens, active_slot_count, slot_events_total, token_loan, trickle_duration_ms), register them, extend gaugeMetrics bookkeeping, add observeTokenGrant, and extend label cleanup.
gRPC Token Grant Path
pkg/mcs/resourcemanager/server/grpc_service.go
On RU-mode token grants, lookup keyspace name and call observeTokenGrant when available before appending granted tokens to the response.
Background Flush & Emission
pkg/mcs/resourcemanager/server/manager.go
Use local gauge handles in backgroundMetricsFlush, read ResourceGroup slot metrics and emit active_slot_count/token_loan, drain slot events and increment slot event counters.
Resource Group Slot APIs
pkg/mcs/resourcemanager/server/resource_group.go
Add exported SlotMetrics type, GetSlotMetrics() to return current slot count and token loan, and DrainSlotEvents() to retrieve & reset slot-created/deleted/expired counters.
Slot Event Tracking
pkg/mcs/resourcemanager/server/token_buckets.go
Track slotsCreated, slotsDeleted, slotsExpired in GroupTokenBucketState; increment them when slots are created, deleted, or expired during balancing/cleanup.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant gRPC as GRPC Service
  participant RG as ResourceGroup
  participant Keyspace as KeyspaceStore
  participant Metrics as Prometheus

  Client->>gRPC: AcquireTokenBuckets(req)
  gRPC->>RG: Acquire tokens (GroupMode_RU)
  RG-->>gRPC: GrantedRUTokenBucket (tokens)
  gRPC->>Keyspace: Lookup keyspaceName(keyspaceID)
  Keyspace-->>gRPC: keyspaceName / error
  alt keyspace found
    gRPC->>Metrics: observeTokenGrant(groupName, keyspaceName, tokens)
  end
  gRPC-->>Client: response with GrantedRUTokens
Loading
sequenceDiagram
  participant Manager
  participant RG as ResourceGroup
  participant Metrics as Prometheus

  Manager->>RG: GetSlotMetrics()
  RG-->>Manager: SlotMetrics{SlotCount, TokenLoan}
  Manager->>Metrics: set active_slot_count, token_loan gauges
  Manager->>RG: DrainSlotEvents()
  RG-->>Manager: created, deleted, expired
  Manager->>Metrics: increment slot_events_total{event=...}
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~23 minutes

Possibly related issues

Possibly related PRs

Suggested labels

size/M, release-note-none, type/development, lgtm, approved

Suggested reviewers

  • nolouch
  • niubell
  • rleungx

Poem

🐰 I hopped in code with metric seeds to sow,

Tokens counted where the small slots grow,
Grants and trickles now sing in the night,
Gauges glow softly, counters take flight,
🥕 Happy hops — observability aglow!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding server-side allocation metrics for the Resource Manager component in the resource_group package.
Description check ✅ Passed The description adequately covers the PR objectives, problem statement (ref #10488), detailed changes across five new metrics, implementation approach, and testing status.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Add new metrics to improve observability of the Resource Manager's token
allocation on the server side. This is the second phase of the
observability improvements tracked in tikv#10488.

Changes:
- B2: Add `granted_tokens` histogram — tokens granted per token bucket
  request, observed in AcquireTokenBuckets gRPC handler
- B3: Add `active_slot_count` gauge — number of active client slots per
  resource group, observed in backgroundMetricsFlush 1Hz tick
- B3: Add `slot_events_total` counter — slot lifecycle events (created,
  deleted, expired), accumulated in balanceSlotTokens and drained in
  backgroundMetricsFlush
- B4: Add `token_loan` gauge — current total token loan (negative tokens
  in slots), observed in backgroundMetricsFlush
- B4: Add `trickle_duration_ms` histogram — trickle duration per token
  grant, observed in AcquireTokenBuckets gRPC handler

New accessor methods on ResourceGroup:
- GetSlotMetrics() — thread-safe snapshot of slot count and loan
- DrainSlotEvents() — read and reset accumulated slot event counters

All gauge metrics observed in 1Hz tick (off request path). Histogram
observations in gRPC handler occur once per 5s per client per group.

Issue Number: ref tikv#10488

Signed-off-by: JmPotato <github@ipotato.me>
@JmPotato JmPotato force-pushed the resource-control/server-allocation-metrics branch from 30c64a8 to 1d4351f Compare March 26, 2026 04:44
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/mcs/resourcemanager/server/grpc_service.go`:
- Around line 255-263: This RPC path must not touch the mutable metrics cache
directly; replace the direct call to s.manager.getKeyspaceNameByID(...) and
s.manager.metrics.getGaugeMetrics(...) and the subsequent direct Observe calls
with a thread-safe, non-blocking metrics API (e.g. add/use a method like
metrics.ObserveGrant(keyspaceID, keyspaceName, rgName, grantedTokens,
trickleMs)) that either takes a snapshot, uses an internal mutex, or enqueues
the observation to a background goroutine so no map reads/writes happen on the
RPC path; avoid using s.ctx for a potentially blocking storage lookup here (use
a pre-resolved name or fall back to keyspaceID) and always emit the trickle
observation (observe 0 rather than skipping when TrickleTimeMs==0) so immediate
grants are recorded.

In `@pkg/mcs/resourcemanager/server/manager.go`:
- Around line 798-806: The slot event counters are always zero because
getResourceGroupList(true, true) returns cloned ResourceGroups whose
GroupTokenBucketState.clone() does not copy
slotsCreated/slotsDeleted/slotsExpired, so group.DrainSlotEvents() reads zeroed
values; fix by preserving the slot event accumulators during cloning or drain
events before cloning. Update GroupTokenBucketState.clone() to copy the
slotsCreated, slotsDeleted, and slotsExpired fields so clones carry the
accumulated events (or alternatively call DrainSlotEvents() on the live
ResourceGroup list before calling Clone in getResourceGroupList), and ensure
group.DrainSlotEvents() is invoked on objects that contain the real accumulator
values so gaugeM.slotCreatedCounter, slotDeletedCounter, and slotExpiredCounter
increment correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dd6f704e-eba1-40e7-83a3-c8c711cb67f9

📥 Commits

Reviewing files that changed from the base of the PR and between 3eb99ae and 30c64a8.

📒 Files selected for processing (5)
  • pkg/mcs/resourcemanager/server/grpc_service.go
  • pkg/mcs/resourcemanager/server/manager.go
  • pkg/mcs/resourcemanager/server/metrics.go
  • pkg/mcs/resourcemanager/server/resource_group.go
  • pkg/mcs/resourcemanager/server/token_buckets.go

Comment on lines +255 to +263
// B2 + B4: Observe granted tokens and trickle duration.
keyspaceName, nameErr := s.manager.getKeyspaceNameByID(s.ctx, keyspaceID)
if nameErr == nil {
gaugeM := s.manager.metrics.getGaugeMetrics(keyspaceID, keyspaceName, rg.Name)
gaugeM.grantedTokensObserver.Observe(tokens.GrantedTokens.GetTokens())
if tokens.TrickleTimeMs > 0 {
gaugeM.trickleDurationObserver.Observe(float64(tokens.TrickleTimeMs))
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Avoid the mutable metrics cache on the RPC path.

pkg/mcs/resourcemanager/server/metrics.go backs getGaugeMetrics() with a plain map, and backgroundMetricsFlush() touches the same cache from another goroutine. Calling it here from concurrent AcquireTokenBuckets() streams can race or hit concurrent map read and map write. This block also uses s.ctx for a storage-backed lookup and skips TrickleTimeMs == 0, so the observation can outlive the RPC and the trickle histogram misses immediate grants.

Possible fix
-					keyspaceName, nameErr := s.manager.getKeyspaceNameByID(s.ctx, keyspaceID)
+					keyspaceName, nameErr := s.manager.getKeyspaceNameByID(stream.Context(), keyspaceID)
 					if nameErr == nil {
-						gaugeM := s.manager.metrics.getGaugeMetrics(keyspaceID, keyspaceName, rg.Name)
-						gaugeM.grantedTokensObserver.Observe(tokens.GrantedTokens.GetTokens())
-						if tokens.TrickleTimeMs > 0 {
-							gaugeM.trickleDurationObserver.Observe(float64(tokens.TrickleTimeMs))
-						}
+						grantedTokensHistogram.WithLabelValues(rg.Name, keyspaceName).Observe(tokens.GrantedTokens.GetTokens())
+						trickleDurationHistogram.WithLabelValues(rg.Name, keyspaceName).Observe(float64(tokens.TrickleTimeMs))
 					}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/resourcemanager/server/grpc_service.go` around lines 255 - 263, This
RPC path must not touch the mutable metrics cache directly; replace the direct
call to s.manager.getKeyspaceNameByID(...) and
s.manager.metrics.getGaugeMetrics(...) and the subsequent direct Observe calls
with a thread-safe, non-blocking metrics API (e.g. add/use a method like
metrics.ObserveGrant(keyspaceID, keyspaceName, rgName, grantedTokens,
trickleMs)) that either takes a snapshot, uses an internal mutex, or enqueues
the observation to a background goroutine so no map reads/writes happen on the
RPC path; avoid using s.ctx for a potentially blocking storage lookup here (use
a pre-resolved name or fall back to keyspaceID) and always emit the trickle
observation (observe 0 rather than skipping when TrickleTimeMs==0) so immediate
grants are recorded.

Comment thread pkg/mcs/resourcemanager/server/manager.go Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 26, 2026

Codecov Report

❌ Patch coverage is 81.25000% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.95%. Comparing base (3eb99ae) to head (1d4351f).

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10495      +/-   ##
==========================================
+ Coverage   78.88%   78.95%   +0.07%     
==========================================
  Files         530      530              
  Lines       71548    71609      +61     
==========================================
+ Hits        56439    56542     +103     
+ Misses      11092    11046      -46     
- Partials     4017     4021       +4     
Flag Coverage Δ
unittests 78.95% <81.25%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
pkg/mcs/resourcemanager/server/grpc_service.go (1)

255-257: ⚠️ Potential issue | 🟠 Major

Use the RPC context here, and don’t drop the metric on lookup errors.

getKeyspaceNameByID can hit storage, so passing s.ctx lets this lookup outlive a canceled stream. The err == nil guard also leaves holes in the new grant metrics whenever name resolution fails; use stream.Context() and fall back to a stable placeholder label instead of skipping the observation.

Possible fix
- if keyspaceName, err := s.manager.getKeyspaceNameByID(s.ctx, keyspaceID); err == nil {
- 	observeTokenGrant(rg.Name, keyspaceName, tokens)
- }
+ keyspaceName, err := s.manager.getKeyspaceNameByID(stream.Context(), keyspaceID)
+ if err != nil {
+ 	keyspaceName = "unknown"
+ }
+ observeTokenGrant(rg.Name, keyspaceName, tokens)

Verification: this should show the s.ctx call site and the storage-backed RunInTxn path inside getKeyspaceNameByID.

#!/bin/bash
set -euo pipefail

echo "AcquireTokenBuckets call site:"
rg -n -C3 'getKeyspaceNameByID\(' pkg/mcs/resourcemanager/server/grpc_service.go

echo
echo "getKeyspaceNameByID implementation:"
rg -n -C12 'func \(m \*Manager\) getKeyspaceNameByID' pkg/mcs/resourcemanager/server/manager.go

As per coding guidelines, "First parameter must be context.Context for external effects; never store contexts in structs".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/resourcemanager/server/grpc_service.go` around lines 255 - 257, The
call to s.manager.getKeyspaceNameByID uses s.ctx and skips metric recording on
lookup errors; change the call site in the grant handling (where
observeTokenGrant(rg.Name, keyspaceName, tokens) is invoked) to use the RPC
stream context (stream.Context()) as the first arg to getKeyspaceNameByID, and
always call observeTokenGrant — if getKeyspaceNameByID returns an error, pass a
stable placeholder label (e.g., "unknown_keyspace") for keyspaceName instead of
skipping the observation so the metric is never dropped.
pkg/mcs/resourcemanager/server/metrics.go (1)

623-625: ⚠️ Potential issue | 🟠 Major

Observe zero-trickle grants too.

trickle_duration_ms is described as a per-grant metric, but the > 0 guard drops immediate grants. That skews the histogram toward delayed grants and makes its count disagree with the total number of grants.

Possible fix
  grantedTokensHistogram.WithLabelValues(groupName, keyspaceName).Observe(tokens.GrantedTokens.GetTokens())
- if tokens.TrickleTimeMs > 0 {
- 	trickleDurationHistogram.WithLabelValues(groupName, keyspaceName).Observe(float64(tokens.TrickleTimeMs))
- }
+ trickleDurationHistogram.WithLabelValues(groupName, keyspaceName).Observe(float64(tokens.TrickleTimeMs))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/resourcemanager/server/metrics.go` around lines 623 - 625, The
trickle-duration histogram currently skips immediate grants by only observing
when tokens.TrickleTimeMs > 0; update the logic around
grantedTokensHistogram.WithLabelValues(...) and
trickleDurationHistogram.WithLabelValues(...) to always call
trickleDurationHistogram.WithLabelValues(groupName,
keyspaceName).Observe(float64(tokens.TrickleTimeMs)) for every grant (including
zero), so the per-grant trickle_duration_ms metric counts every grant and its
histogram matches the total grants observed via grantedTokensHistogram and
GrantedTokens.GetTokens().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/mcs/resourcemanager/server/metrics.go`:
- Around line 607-613: The five group-scoped metrics (grantedTokensHistogram,
activeSlotCountGauge, slotEventsCounter, tokenLoanGauge,
trickleDurationHistogram) are being deleted inside the per-ruType cleanup path
(via DeleteLabelValues) and must be removed from that path; instead, remove
those DeleteLabelValues calls from the code invoked during per-ruType stale
eviction (cleanupAllMetrics / deleteMetrics) and place them only in the
full-group removal code path (the routine that handles complete group deletion)
so group-scoped metrics are deleted only when the entire group is removed.

---

Duplicate comments:
In `@pkg/mcs/resourcemanager/server/grpc_service.go`:
- Around line 255-257: The call to s.manager.getKeyspaceNameByID uses s.ctx and
skips metric recording on lookup errors; change the call site in the grant
handling (where observeTokenGrant(rg.Name, keyspaceName, tokens) is invoked) to
use the RPC stream context (stream.Context()) as the first arg to
getKeyspaceNameByID, and always call observeTokenGrant — if getKeyspaceNameByID
returns an error, pass a stable placeholder label (e.g., "unknown_keyspace") for
keyspaceName instead of skipping the observation so the metric is never dropped.

In `@pkg/mcs/resourcemanager/server/metrics.go`:
- Around line 623-625: The trickle-duration histogram currently skips immediate
grants by only observing when tokens.TrickleTimeMs > 0; update the logic around
grantedTokensHistogram.WithLabelValues(...) and
trickleDurationHistogram.WithLabelValues(...) to always call
trickleDurationHistogram.WithLabelValues(groupName,
keyspaceName).Observe(float64(tokens.TrickleTimeMs)) for every grant (including
zero), so the per-grant trickle_duration_ms metric counts every grant and its
histogram matches the total grants observed via grantedTokensHistogram and
GrantedTokens.GetTokens().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 53096e8e-256d-4896-873e-437c51beb25b

📥 Commits

Reviewing files that changed from the base of the PR and between 30c64a8 and 1d4351f.

📒 Files selected for processing (5)
  • pkg/mcs/resourcemanager/server/grpc_service.go
  • pkg/mcs/resourcemanager/server/manager.go
  • pkg/mcs/resourcemanager/server/metrics.go
  • pkg/mcs/resourcemanager/server/resource_group.go
  • pkg/mcs/resourcemanager/server/token_buckets.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/mcs/resourcemanager/server/token_buckets.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/mcs/resourcemanager/server/manager.go
  • pkg/mcs/resourcemanager/server/resource_group.go

Comment on lines +607 to +613
grantedTokensHistogram.DeleteLabelValues(groupName, keyspaceName)
activeSlotCountGauge.DeleteLabelValues(groupName, keyspaceName)
slotEventsCounter.DeleteLabelValues(groupName, keyspaceName, "created")
slotEventsCounter.DeleteLabelValues(groupName, keyspaceName, "deleted")
slotEventsCounter.DeleteLabelValues(groupName, keyspaceName, "expired")
tokenLoanGauge.DeleteLabelValues(groupName, keyspaceName)
trickleDurationHistogram.DeleteLabelValues(groupName, keyspaceName)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "deleteMetrics/deleteLabelValues definitions:"
rg -n -C5 'func \(m \*metrics\) deleteMetrics|func deleteLabelValues' pkg/mcs/resourcemanager/server/metrics.go

echo
echo "cleanupAllMetrics call sites:"
rg -n -C6 '\bcleanupAllMetrics\(' pkg/mcs/resourcemanager/server

echo
echo "deleteMetrics call sites:"
rg -n -C4 '\bdeleteMetrics\(' pkg/mcs/resourcemanager/server

Repository: tikv/pd

Length of output: 4846


🏁 Script executed:

sed -n '585,630p' pkg/mcs/resourcemanager/server/metrics.go | cat -n

Repository: tikv/pd

Length of output: 3362


These group-scoped metrics must not be deleted from the per-ruType cleanup path.

The new metrics—granted_tokens, active_slot_count, slot_events_total, token_loan, and trickle_duration_ms—are declared with only (groupName, keyspaceName) labels (no ruLabelType). Because they sit inside deleteLabelValues, they are deleted whenever deleteMetrics is called, which happens during every per-ruType stale consumption eviction via cleanupAllMetrics. This means a single expired ruType record will wipe group-scoped metrics that remain relevant for other ruType values or when the group is otherwise active.

Move these five metrics to a separate deletion path that is only invoked for full group removal, not per-ruType stale cleanup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/resourcemanager/server/metrics.go` around lines 607 - 613, The five
group-scoped metrics (grantedTokensHistogram, activeSlotCountGauge,
slotEventsCounter, tokenLoanGauge, trickleDurationHistogram) are being deleted
inside the per-ruType cleanup path (via DeleteLabelValues) and must be removed
from that path; instead, remove those DeleteLabelValues calls from the code
invoked during per-ruType stale eviction (cleanupAllMetrics / deleteMetrics) and
place them only in the full-group removal code path (the routine that handles
complete group deletion) so group-scoped metrics are deleted only when the
entire group is removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant