Fix pinBlocks/unpinBlocksById to correctly handle multiple window sizes#12325
Fix pinBlocks/unpinBlocksById to correctly handle multiple window sizes#12325thorjohnsen wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
pinBlocks now returns an unordered_map<windowSize, blockIds> so callers know exactly which blocks were pinned per window. unpinBlocksById accepts the same map type and correctly routes each block to its window manager, fixing a bug where only the first window manager's blocks were unpinned. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThe changes modify KV cache block management APIs to track pinned blocks per attention window. The Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip You can customize the high-level summary generated by CodeRabbit.Configure the |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`:
- Line 1564: Re-run LLVM clang-format (configured for 120-column width) on the
header and commit the resulting formatting changes; ensure the declaration of
virtual std::unordered_map<SizeType32, std::vector<KVCacheBlock::IdType>>
pinBlocks(LlmRequest::RequestIdType requestId) and the nearby declarations (the
symbols around the pinBlocks method and the other affected declarations near the
KVCacheManager interface) are reformatted to match the project's clang-format
rules and line-length limits, then stage and commit the updated header so CI no
longer reports formatting diffs.
In `@cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`:
- Around line 2676-2683: The file fails clang-format checks; run LLVM
clang-format over the implementation (including the KVCacheManager::pinBlocks
and KVCacheManager::unpinBlocksById regions) to apply the project's formatting
rules, verify the formatting changes (e.g., spacing/brace alignment) are
correct, and commit the formatted file so the CI formatting hook passes.
In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`:
- Around line 4225-4233: The test only checks the pinnedBlocks entry for
maxAttentionWindow so it misses the multi-window unpin regression; update the
test to configure at least two attention windows in the setup, then after
calling kvCacheManager.pinBlocks(requestId) iterate over every entry in the
returned pinnedBlocks map (not just pinnedBlocks.at(maxAttentionWindow)) and for
each window key verify each blockId is present in the corresponding result of
kvCacheManager.getCacheBlockIds(requestId, windowKey) (or the appropriate index)
so pin/unpin behavior is validated across all windows.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8e3eaf46-4cfa-42c6-8e16-40dd3e8cced9
📒 Files selected for processing (3)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.hcpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
| /// @brief Pin blocks associated with a request to prevent eviction. | ||
| /// @param requestId The ID of the request whose blocks should be pinned. | ||
| virtual void pinBlocks(LlmRequest::RequestIdType requestId) = 0; | ||
| virtual std::unordered_map<SizeType32, std::vector<KVCacheBlock::IdType>> pinBlocks(LlmRequest::RequestIdType requestId) = 0; |
There was a problem hiding this comment.
Re-run clang-format for updated declarations in this header.
CI reports formatting changes were auto-applied, and this file is listed as affected. Please format and commit the resulting diff before merge (notably around Lines 1564, 1738, and 2037-2039).
Suggested formatting shape (example)
- virtual std::unordered_map<SizeType32, std::vector<KVCacheBlock::IdType>> pinBlocks(LlmRequest::RequestIdType requestId) = 0;
+ virtual std::unordered_map<SizeType32, std::vector<KVCacheBlock::IdType>> pinBlocks(
+ LlmRequest::RequestIdType requestId)
+ = 0;
- virtual void unpinBlocksById(std::unordered_map<SizeType32, std::vector<KVCacheBlock::IdType>> const& pinnedBlocks) = 0;
+ virtual void unpinBlocksById(
+ std::unordered_map<SizeType32, std::vector<KVCacheBlock::IdType>> const& pinnedBlocks)
+ = 0;Also applies to: 1738-1738, 2037-2039
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h` at line 1564, Re-run
LLVM clang-format (configured for 120-column width) on the header and commit the
resulting formatting changes; ensure the declaration of virtual
std::unordered_map<SizeType32, std::vector<KVCacheBlock::IdType>>
pinBlocks(LlmRequest::RequestIdType requestId) and the nearby declarations (the
symbols around the pinBlocks method and the other affected declarations near the
KVCacheManager interface) are reformatted to match the project's clang-format
rules and line-length limits, then stage and commit the updated header so CI no
longer reports formatting diffs.
| std::unordered_map<SizeType32, std::vector<KVCacheBlock::IdType>> KVCacheManager::pinBlocks(RequestIdType requestId) | ||
| { | ||
| auto& sequence = getSequence(requestId); | ||
| mBlockManager.pinBlocks(sequence); | ||
| return mBlockManager.pinBlocks(sequence); | ||
| } | ||
|
|
||
| void KVCacheManager::unpinBlocksById(std::vector<KVCacheBlock::IdType> const& blockIds) | ||
| void KVCacheManager::unpinBlocksById( | ||
| std::unordered_map<SizeType32, std::vector<KVCacheBlock::IdType>> const& pinnedBlocks) |
There was a problem hiding this comment.
Format this implementation file before merge (CI is failing on clang-format).
Release checks indicate formatting hooks modified this file. Please run clang-format and commit the result.
As per coding guidelines: "Use the LLVM clang-format tool for formatting C++ code changes prior to submitting the PR".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp` around lines 2676 - 2683,
The file fails clang-format checks; run LLVM clang-format over the
implementation (including the KVCacheManager::pinBlocks and
KVCacheManager::unpinBlocksById regions) to apply the project's formatting
rules, verify the formatting changes (e.g., spacing/brace alignment) are
correct, and commit the formatted file so the CI formatting hook passes.
| auto const pinnedBlocks = kvCacheManager.pinBlocks(requestId); | ||
| EXPECT_FALSE(pinnedBlocks.empty()); | ||
| ASSERT_TRUE(pinnedBlocks.count(maxAttentionWindow)); | ||
| auto const& pinnedIds = pinnedBlocks.at(maxAttentionWindow); | ||
| auto const& allBlockIds = kvCacheManager.getCacheBlockIds(requestId, maxAttentionWindow)[0]; | ||
| std::vector<SizeType32> pinnedBlockIds(allBlockIds.begin(), allBlockIds.end()); | ||
| for (auto blockId : pinnedIds) | ||
| { | ||
| EXPECT_NE(std::find(allBlockIds.begin(), allBlockIds.end(), blockId), allBlockIds.end()); | ||
| } |
There was a problem hiding this comment.
This test still does not validate the multi-window unpin regression.
At Line 4227, the assertion only checks maxAttentionWindow, and the test setup uses a single window. That means the bug fixed by this PR (unpinning only the first window manager) would not be caught here. Please make this test configure at least two windows and validate pin/unpin across all returned map entries.
✅ Suggested test strengthening
- auto constexpr maxAttentionWindow = tokensPerBlock * blocksInPrimaryPool;
+ auto constexpr maxAttentionWindow = tokensPerBlock * blocksInPrimaryPool;
+ auto constexpr minAttentionWindow = maxAttentionWindow / 2;
- BlocksPerWindow const blocksPerWindow{{maxAttentionWindow, {blocksInPrimaryPool, blocksInSecondaryPool}}};
+ BlocksPerWindow const blocksPerWindow{
+ {maxAttentionWindow, {blocksInPrimaryPool, blocksInSecondaryPool}},
+ {minAttentionWindow, {blocksInPrimaryPool, blocksInSecondaryPool}},
+ };
KVCacheManager kvCacheManager(numLayers, numKvHeads, sizePerHead, tokensPerBlock, blocksPerWindow, maxNumSequences,
- beamWidth, std::vector<BlockManager::SizeType32>{maxAttentionWindow}, std::nullopt, nvinfer1::DataType::kHALF,
+ beamWidth, std::vector<BlockManager::SizeType32>{maxAttentionWindow, minAttentionWindow}, std::nullopt,
+ nvinfer1::DataType::kHALF,
0, stream, maxAttentionWindow, true, onboardBlocks);
auto const pinnedBlocks = kvCacheManager.pinBlocks(requestId);
EXPECT_FALSE(pinnedBlocks.empty());
- ASSERT_TRUE(pinnedBlocks.count(maxAttentionWindow));
- auto const& pinnedIds = pinnedBlocks.at(maxAttentionWindow);
- auto const& allBlockIds = kvCacheManager.getCacheBlockIds(requestId, maxAttentionWindow)[0];
- for (auto blockId : pinnedIds)
+ ASSERT_THAT(pinnedBlocks, ::testing::Contains(::testing::Key(maxAttentionWindow)));
+ ASSERT_THAT(pinnedBlocks, ::testing::Contains(::testing::Key(minAttentionWindow)));
+ for (auto const& [windowSize, pinnedIds] : pinnedBlocks)
{
- EXPECT_NE(std::find(allBlockIds.begin(), allBlockIds.end(), blockId), allBlockIds.end());
+ auto const& allBlockIds = kvCacheManager.getCacheBlockIds(requestId, windowSize)[0];
+ for (auto blockId : pinnedIds)
+ {
+ EXPECT_NE(std::find(allBlockIds.begin(), allBlockIds.end(), blockId), allBlockIds.end());
+ }
}Also applies to: 4238-4238
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp` around lines 4225
- 4233, The test only checks the pinnedBlocks entry for maxAttentionWindow so it
misses the multi-window unpin regression; update the test to configure at least
two attention windows in the setup, then after calling
kvCacheManager.pinBlocks(requestId) iterate over every entry in the returned
pinnedBlocks map (not just pinnedBlocks.at(maxAttentionWindow)) and for each
window key verify each blockId is present in the corresponding result of
kvCacheManager.getCacheBlockIds(requestId, windowKey) (or the appropriate index)
so pin/unpin behavior is validated across all windows.
… API Update resource_manager.py and py_executor.py to match the updated C++ signatures where pinBlocks returns unordered_map<windowSize, blockIds> and unpinBlocksById accepts the same map type. - KVCacheManager.pin_blocks now returns dict[int, list[int]] instead of None - KVCacheManager.unpin_blocks_by_id parameter changed from int to dict[int, list[int]] - AsyncTransferManager stores the pinned_blocks map (not a bare block_id) and passes it back to unpin_blocks_by_id on transfer completion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@eopXD could you review this? thanks |
Summary
pinBlocksnow returnsunordered_map<SizeType32, vector<KVCacheBlock::IdType>>mapping each window size to the IDs of blocks pinned in that windowunpinBlocksByIdupdated to accept the same map type, routing each block to the correctWindowBlockManagerfor unpinningBlockManager::unpinBlocksByIdonly unpinned blocks from the first window manager, leaving blocks in other windows permanently pinned after sequence removalTest plan
PinAndUnpinBlocksByIdunit test to verify returned map contains expected window size and block IDsremoveSequencewhen pinnedunpinBlocksByIdcpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpprequire no code changes; nanobind auto-convertsunordered_maptodict[int, list[int]](headers already includenanobind/stl/unordered_map.handnanobind/stl/vector.h)🤖 Generated with Claude Code
Summary by CodeRabbit
Refactor
Tests