resource_group: add server-side allocation metrics for Resource Manager#10495
resource_group: add server-side allocation metrics for Resource Manager#10495JmPotato wants to merge 1 commit intotikv:masterfrom
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughAdds server-side observability for the token-bucket resource manager: new Prometheus metrics, per-group/keyspace metric bookkeeping, metric emission on token grants in the gRPC path, and periodic emission of slot metrics from the manager. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant gRPC as GRPC Service
participant RG as ResourceGroup
participant Keyspace as KeyspaceStore
participant Metrics as Prometheus
Client->>gRPC: AcquireTokenBuckets(req)
gRPC->>RG: Acquire tokens (GroupMode_RU)
RG-->>gRPC: GrantedRUTokenBucket (tokens)
gRPC->>Keyspace: Lookup keyspaceName(keyspaceID)
Keyspace-->>gRPC: keyspaceName / error
alt keyspace found
gRPC->>Metrics: observeTokenGrant(groupName, keyspaceName, tokens)
end
gRPC-->>Client: response with GrantedRUTokens
sequenceDiagram
participant Manager
participant RG as ResourceGroup
participant Metrics as Prometheus
Manager->>RG: GetSlotMetrics()
RG-->>Manager: SlotMetrics{SlotCount, TokenLoan}
Manager->>Metrics: set active_slot_count, token_loan gauges
Manager->>RG: DrainSlotEvents()
RG-->>Manager: created, deleted, expired
Manager->>Metrics: increment slot_events_total{event=...}
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~23 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Add new metrics to improve observability of the Resource Manager's token allocation on the server side. This is the second phase of the observability improvements tracked in tikv#10488. Changes: - B2: Add `granted_tokens` histogram — tokens granted per token bucket request, observed in AcquireTokenBuckets gRPC handler - B3: Add `active_slot_count` gauge — number of active client slots per resource group, observed in backgroundMetricsFlush 1Hz tick - B3: Add `slot_events_total` counter — slot lifecycle events (created, deleted, expired), accumulated in balanceSlotTokens and drained in backgroundMetricsFlush - B4: Add `token_loan` gauge — current total token loan (negative tokens in slots), observed in backgroundMetricsFlush - B4: Add `trickle_duration_ms` histogram — trickle duration per token grant, observed in AcquireTokenBuckets gRPC handler New accessor methods on ResourceGroup: - GetSlotMetrics() — thread-safe snapshot of slot count and loan - DrainSlotEvents() — read and reset accumulated slot event counters All gauge metrics observed in 1Hz tick (off request path). Histogram observations in gRPC handler occur once per 5s per client per group. Issue Number: ref tikv#10488 Signed-off-by: JmPotato <github@ipotato.me>
30c64a8 to
1d4351f
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/mcs/resourcemanager/server/grpc_service.go`:
- Around line 255-263: This RPC path must not touch the mutable metrics cache
directly; replace the direct call to s.manager.getKeyspaceNameByID(...) and
s.manager.metrics.getGaugeMetrics(...) and the subsequent direct Observe calls
with a thread-safe, non-blocking metrics API (e.g. add/use a method like
metrics.ObserveGrant(keyspaceID, keyspaceName, rgName, grantedTokens,
trickleMs)) that either takes a snapshot, uses an internal mutex, or enqueues
the observation to a background goroutine so no map reads/writes happen on the
RPC path; avoid using s.ctx for a potentially blocking storage lookup here (use
a pre-resolved name or fall back to keyspaceID) and always emit the trickle
observation (observe 0 rather than skipping when TrickleTimeMs==0) so immediate
grants are recorded.
In `@pkg/mcs/resourcemanager/server/manager.go`:
- Around line 798-806: The slot event counters are always zero because
getResourceGroupList(true, true) returns cloned ResourceGroups whose
GroupTokenBucketState.clone() does not copy
slotsCreated/slotsDeleted/slotsExpired, so group.DrainSlotEvents() reads zeroed
values; fix by preserving the slot event accumulators during cloning or drain
events before cloning. Update GroupTokenBucketState.clone() to copy the
slotsCreated, slotsDeleted, and slotsExpired fields so clones carry the
accumulated events (or alternatively call DrainSlotEvents() on the live
ResourceGroup list before calling Clone in getResourceGroupList), and ensure
group.DrainSlotEvents() is invoked on objects that contain the real accumulator
values so gaugeM.slotCreatedCounter, slotDeletedCounter, and slotExpiredCounter
increment correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: dd6f704e-eba1-40e7-83a3-c8c711cb67f9
📒 Files selected for processing (5)
pkg/mcs/resourcemanager/server/grpc_service.gopkg/mcs/resourcemanager/server/manager.gopkg/mcs/resourcemanager/server/metrics.gopkg/mcs/resourcemanager/server/resource_group.gopkg/mcs/resourcemanager/server/token_buckets.go
| // B2 + B4: Observe granted tokens and trickle duration. | ||
| keyspaceName, nameErr := s.manager.getKeyspaceNameByID(s.ctx, keyspaceID) | ||
| if nameErr == nil { | ||
| gaugeM := s.manager.metrics.getGaugeMetrics(keyspaceID, keyspaceName, rg.Name) | ||
| gaugeM.grantedTokensObserver.Observe(tokens.GrantedTokens.GetTokens()) | ||
| if tokens.TrickleTimeMs > 0 { | ||
| gaugeM.trickleDurationObserver.Observe(float64(tokens.TrickleTimeMs)) | ||
| } | ||
| } |
There was a problem hiding this comment.
Avoid the mutable metrics cache on the RPC path.
pkg/mcs/resourcemanager/server/metrics.go backs getGaugeMetrics() with a plain map, and backgroundMetricsFlush() touches the same cache from another goroutine. Calling it here from concurrent AcquireTokenBuckets() streams can race or hit concurrent map read and map write. This block also uses s.ctx for a storage-backed lookup and skips TrickleTimeMs == 0, so the observation can outlive the RPC and the trickle histogram misses immediate grants.
Possible fix
- keyspaceName, nameErr := s.manager.getKeyspaceNameByID(s.ctx, keyspaceID)
+ keyspaceName, nameErr := s.manager.getKeyspaceNameByID(stream.Context(), keyspaceID)
if nameErr == nil {
- gaugeM := s.manager.metrics.getGaugeMetrics(keyspaceID, keyspaceName, rg.Name)
- gaugeM.grantedTokensObserver.Observe(tokens.GrantedTokens.GetTokens())
- if tokens.TrickleTimeMs > 0 {
- gaugeM.trickleDurationObserver.Observe(float64(tokens.TrickleTimeMs))
- }
+ grantedTokensHistogram.WithLabelValues(rg.Name, keyspaceName).Observe(tokens.GrantedTokens.GetTokens())
+ trickleDurationHistogram.WithLabelValues(rg.Name, keyspaceName).Observe(float64(tokens.TrickleTimeMs))
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/mcs/resourcemanager/server/grpc_service.go` around lines 255 - 263, This
RPC path must not touch the mutable metrics cache directly; replace the direct
call to s.manager.getKeyspaceNameByID(...) and
s.manager.metrics.getGaugeMetrics(...) and the subsequent direct Observe calls
with a thread-safe, non-blocking metrics API (e.g. add/use a method like
metrics.ObserveGrant(keyspaceID, keyspaceName, rgName, grantedTokens,
trickleMs)) that either takes a snapshot, uses an internal mutex, or enqueues
the observation to a background goroutine so no map reads/writes happen on the
RPC path; avoid using s.ctx for a potentially blocking storage lookup here (use
a pre-resolved name or fall back to keyspaceID) and always emit the trickle
observation (observe 0 rather than skipping when TrickleTimeMs==0) so immediate
grants are recorded.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #10495 +/- ##
==========================================
+ Coverage 78.88% 78.95% +0.07%
==========================================
Files 530 530
Lines 71548 71609 +61
==========================================
+ Hits 56439 56542 +103
+ Misses 11092 11046 -46
- Partials 4017 4021 +4
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
pkg/mcs/resourcemanager/server/grpc_service.go (1)
255-257:⚠️ Potential issue | 🟠 MajorUse the RPC context here, and don’t drop the metric on lookup errors.
getKeyspaceNameByIDcan hit storage, so passings.ctxlets this lookup outlive a canceled stream. Theerr == nilguard also leaves holes in the new grant metrics whenever name resolution fails; usestream.Context()and fall back to a stable placeholder label instead of skipping the observation.Possible fix
- if keyspaceName, err := s.manager.getKeyspaceNameByID(s.ctx, keyspaceID); err == nil { - observeTokenGrant(rg.Name, keyspaceName, tokens) - } + keyspaceName, err := s.manager.getKeyspaceNameByID(stream.Context(), keyspaceID) + if err != nil { + keyspaceName = "unknown" + } + observeTokenGrant(rg.Name, keyspaceName, tokens)Verification: this should show the
s.ctxcall site and the storage-backedRunInTxnpath insidegetKeyspaceNameByID.#!/bin/bash set -euo pipefail echo "AcquireTokenBuckets call site:" rg -n -C3 'getKeyspaceNameByID\(' pkg/mcs/resourcemanager/server/grpc_service.go echo echo "getKeyspaceNameByID implementation:" rg -n -C12 'func \(m \*Manager\) getKeyspaceNameByID' pkg/mcs/resourcemanager/server/manager.goAs per coding guidelines, "First parameter must be
context.Contextfor external effects; never store contexts in structs".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/mcs/resourcemanager/server/grpc_service.go` around lines 255 - 257, The call to s.manager.getKeyspaceNameByID uses s.ctx and skips metric recording on lookup errors; change the call site in the grant handling (where observeTokenGrant(rg.Name, keyspaceName, tokens) is invoked) to use the RPC stream context (stream.Context()) as the first arg to getKeyspaceNameByID, and always call observeTokenGrant — if getKeyspaceNameByID returns an error, pass a stable placeholder label (e.g., "unknown_keyspace") for keyspaceName instead of skipping the observation so the metric is never dropped.pkg/mcs/resourcemanager/server/metrics.go (1)
623-625:⚠️ Potential issue | 🟠 MajorObserve zero-trickle grants too.
trickle_duration_msis described as a per-grant metric, but the> 0guard drops immediate grants. That skews the histogram toward delayed grants and makes itscountdisagree with the total number of grants.Possible fix
grantedTokensHistogram.WithLabelValues(groupName, keyspaceName).Observe(tokens.GrantedTokens.GetTokens()) - if tokens.TrickleTimeMs > 0 { - trickleDurationHistogram.WithLabelValues(groupName, keyspaceName).Observe(float64(tokens.TrickleTimeMs)) - } + trickleDurationHistogram.WithLabelValues(groupName, keyspaceName).Observe(float64(tokens.TrickleTimeMs))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/mcs/resourcemanager/server/metrics.go` around lines 623 - 625, The trickle-duration histogram currently skips immediate grants by only observing when tokens.TrickleTimeMs > 0; update the logic around grantedTokensHistogram.WithLabelValues(...) and trickleDurationHistogram.WithLabelValues(...) to always call trickleDurationHistogram.WithLabelValues(groupName, keyspaceName).Observe(float64(tokens.TrickleTimeMs)) for every grant (including zero), so the per-grant trickle_duration_ms metric counts every grant and its histogram matches the total grants observed via grantedTokensHistogram and GrantedTokens.GetTokens().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/mcs/resourcemanager/server/metrics.go`:
- Around line 607-613: The five group-scoped metrics (grantedTokensHistogram,
activeSlotCountGauge, slotEventsCounter, tokenLoanGauge,
trickleDurationHistogram) are being deleted inside the per-ruType cleanup path
(via DeleteLabelValues) and must be removed from that path; instead, remove
those DeleteLabelValues calls from the code invoked during per-ruType stale
eviction (cleanupAllMetrics / deleteMetrics) and place them only in the
full-group removal code path (the routine that handles complete group deletion)
so group-scoped metrics are deleted only when the entire group is removed.
---
Duplicate comments:
In `@pkg/mcs/resourcemanager/server/grpc_service.go`:
- Around line 255-257: The call to s.manager.getKeyspaceNameByID uses s.ctx and
skips metric recording on lookup errors; change the call site in the grant
handling (where observeTokenGrant(rg.Name, keyspaceName, tokens) is invoked) to
use the RPC stream context (stream.Context()) as the first arg to
getKeyspaceNameByID, and always call observeTokenGrant — if getKeyspaceNameByID
returns an error, pass a stable placeholder label (e.g., "unknown_keyspace") for
keyspaceName instead of skipping the observation so the metric is never dropped.
In `@pkg/mcs/resourcemanager/server/metrics.go`:
- Around line 623-625: The trickle-duration histogram currently skips immediate
grants by only observing when tokens.TrickleTimeMs > 0; update the logic around
grantedTokensHistogram.WithLabelValues(...) and
trickleDurationHistogram.WithLabelValues(...) to always call
trickleDurationHistogram.WithLabelValues(groupName,
keyspaceName).Observe(float64(tokens.TrickleTimeMs)) for every grant (including
zero), so the per-grant trickle_duration_ms metric counts every grant and its
histogram matches the total grants observed via grantedTokensHistogram and
GrantedTokens.GetTokens().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 53096e8e-256d-4896-873e-437c51beb25b
📒 Files selected for processing (5)
pkg/mcs/resourcemanager/server/grpc_service.gopkg/mcs/resourcemanager/server/manager.gopkg/mcs/resourcemanager/server/metrics.gopkg/mcs/resourcemanager/server/resource_group.gopkg/mcs/resourcemanager/server/token_buckets.go
✅ Files skipped from review due to trivial changes (1)
- pkg/mcs/resourcemanager/server/token_buckets.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/mcs/resourcemanager/server/manager.go
- pkg/mcs/resourcemanager/server/resource_group.go
| grantedTokensHistogram.DeleteLabelValues(groupName, keyspaceName) | ||
| activeSlotCountGauge.DeleteLabelValues(groupName, keyspaceName) | ||
| slotEventsCounter.DeleteLabelValues(groupName, keyspaceName, "created") | ||
| slotEventsCounter.DeleteLabelValues(groupName, keyspaceName, "deleted") | ||
| slotEventsCounter.DeleteLabelValues(groupName, keyspaceName, "expired") | ||
| tokenLoanGauge.DeleteLabelValues(groupName, keyspaceName) | ||
| trickleDurationHistogram.DeleteLabelValues(groupName, keyspaceName) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "deleteMetrics/deleteLabelValues definitions:"
rg -n -C5 'func \(m \*metrics\) deleteMetrics|func deleteLabelValues' pkg/mcs/resourcemanager/server/metrics.go
echo
echo "cleanupAllMetrics call sites:"
rg -n -C6 '\bcleanupAllMetrics\(' pkg/mcs/resourcemanager/server
echo
echo "deleteMetrics call sites:"
rg -n -C4 '\bdeleteMetrics\(' pkg/mcs/resourcemanager/serverRepository: tikv/pd
Length of output: 4846
🏁 Script executed:
sed -n '585,630p' pkg/mcs/resourcemanager/server/metrics.go | cat -nRepository: tikv/pd
Length of output: 3362
These group-scoped metrics must not be deleted from the per-ruType cleanup path.
The new metrics—granted_tokens, active_slot_count, slot_events_total, token_loan, and trickle_duration_ms—are declared with only (groupName, keyspaceName) labels (no ruLabelType). Because they sit inside deleteLabelValues, they are deleted whenever deleteMetrics is called, which happens during every per-ruType stale consumption eviction via cleanupAllMetrics. This means a single expired ruType record will wipe group-scoped metrics that remain relevant for other ruType values or when the group is otherwise active.
Move these five metrics to a separate deletion path that is only invoked for full group removal, not per-ruType stale cleanup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/mcs/resourcemanager/server/metrics.go` around lines 607 - 613, The five
group-scoped metrics (grantedTokensHistogram, activeSlotCountGauge,
slotEventsCounter, tokenLoanGauge, trickleDurationHistogram) are being deleted
inside the per-ruType cleanup path (via DeleteLabelValues) and must be removed
from that path; instead, remove those DeleteLabelValues calls from the code
invoked during per-ruType stale eviction (cleanupAllMetrics / deleteMetrics) and
place them only in the full-group removal code path (the routine that handles
complete group deletion) so group-scoped metrics are deleted only when the
entire group is removed.
What problem does this PR solve?
Issue Number: ref #10488
What is changed and how does it work?
This is Phase 2 of the Resource Control observability improvements. It adds server-side metrics to the Resource Manager (
pkg/mcs/resourcemanager/server/) for visibility into token allocation decisions, slot lifecycle, loan state, and trickle behavior.Changes:
B2 — Granted tokens histogram
resource_manager_server_granted_tokens— histogram of tokens granted per token bucket requestAcquireTokenBucketsgRPC handler after each successful token grantB3 — Slot lifecycle metrics
resource_manager_server_active_slot_count— gauge of active client slots per resource groupresource_manager_server_slot_events_total— counter of slot events (created/deleted/expired)slotsCreated/slotsDeleted/slotsExpiredfields inGroupTokenBucketState, drained in 1HzbackgroundMetricsFlushGetSlotMetrics()accessorB4 — Token loan + trickle duration
resource_manager_server_token_loan— gauge of total loan (negative tokens in slots) per groupresource_manager_server_trickle_duration_ms— histogram of trickle duration per grantAll new gauge metrics follow the existing
gaugeMetricscaching pattern (pre-resolvedWithLabelValues). Cleanup handled indeleteLabelValues.Check List
Tests
Summary by CodeRabbit