intentresolver: chunk intent resolution to limit in-flight requests#166156
intentresolver: chunk intent resolution to limit in-flight requests#166156tbg wants to merge 3 commits intocockroachdb:masterfrom
Conversation
Previously, `resolveIntents` submitted all of a transaction's intents to the request batcher in a tight loop, then collected all responses. For a txn with 70k intents, this created ~700 batcher goroutines. With many such txns processed concurrently by MVCC GC, the node was overwhelmed by hundreds of thousands of goroutines. Now, intents are submitted in chunks of 1000 and each chunk's responses are collected before the next chunk is submitted. When `len(reqs) <= 1000` (the common case), there is a single iteration — identical to the previous behavior. Fixes: cockroachdb#166037 Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
|
Merging to
|
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Add an IntPool-based budget that limits the total number of in-flight intents across all concurrent GC-initiated async cleanup goroutines to 1000. Each goroutine acquires budget equal to its transaction's lock span count before proceeding with intent resolution. For transactions with more intents than the budget capacity (e.g. 70k), the IntPool truncates the acquisition to the full capacity, effectively consuming the entire budget until that goroutine completes. This is the "overshoot by one txn" behavior — we can't split a transaction, so one large txn is allowed through but blocks all others. This complements the existing goroutine semaphore (ir.sem, WaitForSem: false) which limits the number of concurrent goroutines. The intent budget operates inside those goroutines, blocking (not skipping) until budget is available. Together with the per-caller chunking from the previous commit, this bounds the actual in-flight intent resolution work at the node level. Informs: cockroachdb#166037 Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add doc.go with a comprehensive overview of the intent resolver's use cases, resolution pipeline, request batching, concurrency limiters, and admission control policy. The doc walks through each caller (foreground resolution, post-txn cleanup, encountered-intent cleanup, MVCC GC) and explains which limiters apply, why the request batcher is intentionally unlimited, and how admission control priority is adjusted to prevent lock priority inversion. Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
themavik
left a comment
There was a problem hiding this comment.
Reviewed the changes — the implementation is clean and addresses the reported issue.
themavik
left a comment
There was a problem hiding this comment.
What this PR does: Chunks intent resolution to limit in-flight requests and adds gcIntentBudget to prevent GC from overwhelming the node.
Done well: The doc.go is excellent—clear explanation of the four use cases, resolution pipeline, request batching, and all concurrency limiters (ir.sem, gcIntentBudget, maxIntentsInFlightPerCaller, lockInFlightTxnCleanup, inFlightPushes). The chunking in resolveIntents (maxIntentsInFlightPerCaller=1000) correctly collects responses per chunk before submitting the next. gcIntentBudget with IntPool truncation handles large transactions.
Note: When a transaction has >1000 intents, it consumes the full budget until completion—doc.go explains this. The ctx.Done and stopper.ShouldQuiesce checks in the chunk loop are good for graceful shutdown.
themavik
left a comment
There was a problem hiding this comment.
What it does: Chunks intent resolution to limit in-flight requests—adds gcIntentBudget (1000 intents) and per-caller chunking (1000 intents) to prevent goroutine explosion from large transactions.
Done well: Excellent doc.go—clearly documents the four use cases, resolution pipeline, request batching, and how the limiters compose. The gcIntentBudget acquisition in CleanupTxnIntentsOnGCAsync with IntPool truncation for oversized txns is the right approach. Chunking resolveIntents with maxIntentsInFlightPerCaller prevents 70k intents from spawning 70k batcher goroutines. Proper ctx.Done and stopper handling in the chunk loop.
Suggestion: Consider making gcMaxIntentsInFlight and maxIntentsInFlightPerCaller cluster settings for tuning in production—1000 may not fit all workloads. The doc already mentions gcIntentBudget capacity; adding the chunk size to the doc would help.
Summary
Two-pronged approach to limit the goroutine/intent storm caused by MVCC
GC resolving intents for transactions with many intents (e.g. 70k):
Commit 1: Per-caller chunking —
resolveIntentsnow submitsintents to the request batcher in chunks of 1000 (instead of all at
once), collecting responses for each chunk before submitting the next.
When
len(reqs) <= 1000(the common case), behavior is identical tobefore.
Commit 2: Node-level intent budget — An IntPool-based budget
limits the total number of in-flight intents across all concurrent
GC-initiated async cleanup goroutines to 1000. Goroutines block (not
skip) until budget is available. For large transactions exceeding the
budget, the IntPool truncates the acquisition to the full capacity,
effectively consuming the entire budget until that goroutine
completes ("overshoot by one txn").
Fixes: #166037
Epic: none
Release note: None