[None][feat] Cache Transceivers for Mamba States #10934

NVShreyas · 2026-01-22T19:38:34Z

Summary by CodeRabbit

New Features
- Added RNN state management with cache block allocation and per-request state indexing.
- Introduced disaggregated RNN cache transfer with asynchronous and synchronous operations.
- Enhanced cache buffer management for distributed transfers.
- Expanded attention metadata support for Mamba models with state index management.
Tests
- Added unit tests for RNN cache formatter and target rank computation.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

NVShreyas · 2026-01-22T19:39:46Z

/bot run

tensorrt-cicd · 2026-01-22T19:45:46Z

PR_Github #33217 [ run ] triggered by Bot. Commit: 559ca89

coderabbitai · 2026-01-22T19:53:21Z

📝 Walkthrough

Walkthrough

This pull request introduces comprehensive RNN/Mamba cache transfer infrastructure for disaggregated state management. New components include RnnCacheFormatter, RnnCacheTransceiver with sender/receiver pairs, RnnStateManager cache block allocation, buffer management framework, and RnnCacheState serialization support. Python bindings and attention backends are updated to integrate Mamba metadata handling.

Changes

Cohort / File(s)	Summary
RNN Cache State & Serialization `cpp/include/tensorrt_llm/executor/dataTransceiverState.h`, `cpp/tensorrt_llm/executor/serialization.cpp`, `cpp/include/tensorrt_llm/executor/serialization.h`, `cpp/tensorrt_llm/executor/serializeUtils.h`	New `rnn_cache::RnnCacheState` with ModelConfig, ParallelConfig; integrated into DataTransceiverState as optional member. Serialization/deserialization support added with field-by-field composition and size calculation.
RNN Cache Transfer Framework `cpp/include/tensorrt_llm/batch_manager/rnnCacheFormatter.h`, `cpp/tensorrt_llm/batch_manager/rnnCacheFormatter.cpp`	New `RnnCacheFormatter` class deriving from `BaseCacheFormatter`; implements RNN-specific support inquiry, counterpart mapping, and connection selection via `rnnTargetIRanks` helper. Handles state compatibility checking and PP/TP rank targeting.
RNN Cache Transceiver `cpp/include/tensorrt_llm/batch_manager/rnnCacheTransceiver.h`, `cpp/tensorrt_llm/batch_manager/rnnCacheTransceiver.cpp`	New `RnnCacheSender`, `RnnCacheReceiver`, and `RnnCacheTransceiver` classes managing disaggregated RNN state transfer. Supports async/sync context/generation transfer, MPI/UCX backend initialization, cross-rank synchronization, and layered transfer progress tracking with futures.
RNN State Manager Enhancements `cpp/include/tensorrt_llm/batch_manager/rnnStateManager.h`, `cpp/tensorrt_llm/batch_manager/rnnStateManager.cpp`	Added constructor overload with WorldConfig and layer masking; new cache block allocation/freeing (allocateCacheBlocks, freeCacheBlock). State index mapping via getStateIndices with padding support. Per-layer state accessors (getConvStates, getSsmStates) with offset-based lookup.
Base Transfer Buffer Manager `cpp/tensorrt_llm/batch_manager/baseTransBuffer.h`, `cpp/tensorrt_llm/batch_manager/baseTransBuffer.cpp`	New `BaseTransBufferManager` providing centralized buffer management for cache transfers. Handles buffer index assignment/release, dynamic/pre-allocated buffers, FabricMemory integration, and concurrency resource management via mutex-protected allocations.
Cache Transfer Buffer Manager `cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h`, `cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp`	Refactored `CacheTransBufferManager` to inherit from `BaseTransBufferManager`; simplified constructor with `computeTransferBufferSize` helper selecting data type from KV cache pools. Removed per-buffer lifecycle methods in favor of base class delegation.
Data Transceiver Base Classes `cpp/tensorrt_llm/batch_manager/dataTransceiver.h`, `cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp`	New `BaseCacheSenderImpl` and `BaseCacheReceiverImpl` base classes replacing prior Impl; add threading, promise/future tracking, per-request cancellation, and async send/receive workflows. `CacheSender` and `CacheReceiver` now inherit from bases with protected state accessors; extended public surface for request lifecycle management.
Batch Manager Build `cpp/tensorrt_llm/batch_manager/CMakeLists.txt`	Added source files: `baseTransBuffer.cpp`, `rnnCacheFormatter.cpp`, `rnnCacheTransceiver.cpp`, `rnnCacheTransBuffer.cpp` to SRCS list.
Python Bindings - Batch Manager `cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp`, `cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp`	Exposed new RnnStateManager constructor overload with extended parameters (dState, dConv, numHeads, nGroups, headDim, numLayers, maxBatchSize, worldConfig, stream, dtype, ssmCacheDtype, layerMask). Added Python methods: get_cache_index, get_conv_states, get_ssm_states, allocate_cache_blocks, free_cache_block, get_state_indices with GIL guards.
Attention Backend Mamba Metadata `tensorrt_llm/_torch/attention_backend/interface.py`, `tensorrt_llm/_torch/attention_backend/flashinfer.py`, `tensorrt_llm/_torch/attention_backend/trtllm.py`, `tensorrt_llm/_torch/attention_backend/vanilla.py`	Added `mamba_metadata` and `mamba_chunk_size` fields to `AttentionMetadata`. New `_prepare_mamba_metadata()` method initializes MambaCacheManager state; all prepare() methods now call superclass prepare().
Python Torch Models `tensorrt_llm/_torch/models/modeling_nemotron_h.py`, `tensorrt_llm/_torch/models/modeling_qwen3_next.py`	Refactored to use `attn_metadata.mamba_metadata` directly instead of per-instance caching; removed self.mamba_metadata initialization and preparation logic, reducing redundancy.
Python Mamba Modules `tensorrt_llm/_torch/modules/mamba/mamba2_metadata.py`, `tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py`	New state_indices initialization in Mamba2Metadata.prepare() via KVCacheManager.get_state_indices with padding awareness. State index source changed from attn_metadata.kv_cache_manager to mamba_metadata.state_indices in mixer forward pass.
Python Mamba Cache Manager `tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py`	Replaced internal buffer management with delegation to new `RnnStateManagerCpp` (mamba_impl). Removed explicit conv/ssm state tensors and block tracking; added WorldConfig-based setup, stream binding, and new accessors: get_cache_index, get_state_indices, get_conv_states, get_ssm_states. Extended MambaHybridCacheManager with execution_stream parameter.
Python CUDA Graph & Model Engine `tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py`, `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Removed MambaCacheManager padding reordering logic from CUDA graph runner. Added token count aggregation via dist.tp_allgather in model_engine prepare paths when attention_dp enabled, populating all_rank_num_tokens.
Cache Transceiver & Unit Tests `cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h`, `cpp/tests/unit_tests/batch_manager/CMakeLists.txt`, `cpp/tests/unit_tests/batch_manager/rnnCacheFormatterTest.cpp`	Added `gatherRequestIds()` free function for cross-rank request ID aggregation. New unit test rnnCacheFormatterTest validates RnnTargetIRanks logic with varying PP/TP configurations, non-uniform layer distributions, and inquireSupport compatibility checks.
Tools `tensorrt_llm/tools/layer_wise_benchmarks/runner.py`	Changed `add_dummy_requests` call to use keyword argument `token_nums` instead of positional parameter.

Sequence Diagram(s)

sequenceDiagram
    actor Req as Request
    participant RnnCacheTx as RnnCacheTransceiver
    participant RnnSend as RnnCacheSender
    participant RnnRecv as RnnCacheReceiver
    participant CommGrp as CacheTransceiverComm<br/>(MPI/UCX)
    participant RnnMgr as RnnStateManager
    participant Fmt as RnnCacheFormatter

    Req->>RnnCacheTx: respondAndSendAsync(llmRequest)
    activate RnnCacheTx
    RnnCacheTx->>RnnCacheTx: setContextState(llmRequest)
    RnnCacheTx->>RnnSend: sendAsync(llmRequest)
    activate RnnSend
    RnnSend->>Fmt: inquireSupport(selfState, peerState)
    RnnSend->>Fmt: getCounterparts(selfState, idx, peerState)
    RnnSend->>CommGrp: send RNN state data
    RnnSend-->>RnnCacheTx: return future
    deactivate RnnSend
    deactivate RnnCacheTx

    Req->>RnnCacheTx: requestAndReceiveAsync(llmRequest)
    activate RnnCacheTx
    RnnCacheTx->>RnnRecv: receiveAsync(llmRequest)
    activate RnnRecv
    RnnRecv->>Fmt: inquireSupport(selfState, peerState)
    RnnRecv->>CommGrp: wait for RNN state data
    RnnRecv->>RnnMgr: allocateCacheBlocks(requestIds)
    RnnRecv-->>RnnCacheTx: return future
    deactivate RnnRecv
    deactivate RnnCacheTx

    Req->>RnnCacheTx: checkContextTransferStatus()
    activate RnnCacheTx
    RnnCacheTx->>CommGrp: aggregate completion across ranks
    RnnCacheTx->>Req: update request state
    deactivate RnnCacheTx

    Req->>RnnCacheTx: checkGenTransferStatus()
    activate RnnCacheTx
    RnnCacheTx->>CommGrp: synchronize generation transfer completion
    RnnCacheTx->>RnnMgr: query state indices
    RnnCacheTx->>Req: update generation progress
    deactivate RnnCacheTx

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 18.13% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description is mostly the template with placeholder sections. While it follows the structure, all substantive sections (Description, Test Coverage) remain unfilled with actual details about what is being implemented or tested.	Fill in the Description section explaining why cache transceivers for Mamba are needed and how they work. Add Test Coverage section documenting which tests validate the changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main feature being added: cache transceivers for Mamba/RNN states.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)

tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (1)
1-1: Update header year to include 2026.
This file was modified in 2026; please bump the copyright range.
Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tensorrt_llm/_torch/modules/mamba/mamba2_metadata.py (1)
1-2: Update SPDX year to 2026.
This file now includes 2026 modifications but the header still ends at 2024.
📝 Proposed update
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines, please update the year.
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)
1-2: Update copyright year to 2026.
The file now has 2026 modifications but the header still ends at 2024.
📝 Proposed update
- * Copyright (c) 2023-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2023-2026, NVIDIA CORPORATION.  All rights reserved.
As per coding guidelines, please update the year.
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (1)
1-2: Update SPDX year to 2026.
This file now includes 2026 modifications but the header still ends at 2025.
📝 Proposed update
-# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines, please update the year.
tensorrt_llm/_torch/attention_backend/interface.py (1)
1-7: Add the NVIDIA SPDX header.
This source file is missing the required SPDX/license header.
📝 Proposed header
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import copy
As per coding guidelines, please add the SPDX header.
tensorrt_llm/_torch/attention_backend/flashinfer.py (1)
1-3: Add the NVIDIA SPDX header.
This source file is missing the required SPDX/license header.
📝 Proposed header
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import math
As per coding guidelines, please add the SPDX header.
tensorrt_llm/_torch/attention_backend/vanilla.py (1)
1-3: Add the NVIDIA SPDX header.
This source file is missing the required SPDX/license header.
📝 Proposed header
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import math
As per coding guidelines, please add the SPDX header.
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.cpp (1)
131-131: Typo in function name: supportFbaricMemory should be supportFabricMemory.

The function name has a typo - "Fbaric" instead of "Fabric".
🔤 Proposed fix
-bool FabricMemory::supportFbaricMemory()
+bool FabricMemory::supportFabricMemory()
Note: This will also require updating the header file declaration and any call sites.
cpp/tensorrt_llm/batch_manager/cacheTransBuffer.h (1)
53-53: Same typo in header declaration: supportFbaricMemory.

This matches the typo flagged in the .cpp file. Both should be fixed together.
🔤 Proposed fix
-    static bool supportFbaricMemory();
+    static bool supportFabricMemory();
cpp/include/tensorrt_llm/batch_manager/rnnStateManager.h (1)
1-15: Update copyright year to include 2025.

The copyright header shows "2023-2024" but this file has been modified with significant new functionality in 2025. Per coding guidelines, the year should reflect the latest meaningful modification.
📅 Proposed fix
 /*
- * Copyright (c) 2023-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2023-2025, NVIDIA CORPORATION.  All rights reserved.
  *

🤖 Fix all issues with AI agents

In `@cpp/include/tensorrt_llm/batch_manager/rnnCacheTransceiver.h`:
- Around line 8-44: Replace the non-compliant header protection and comment
style: remove "#pragma once" and add the explicit include guard
"TRTLLM_RNNCACHETRANSCEIVER_H" (i.e., `#ifndef` TRTLLM_RNNCACHETRANSCEIVER_H /
`#define` ... / `#endif`) at the top/bottom of the header, and convert C++03-style
triple-slash Doxygen comments (e.g., "/// `@brief`" and other "///" comments
associated with RnnCacheSender, its constructor and members) to the project
standard "//!"/"//!<" style so class-level docs and member descriptions (for
RnnCacheSender, RnnCacheFormatter references, etc.) use //! and //!< where
appropriate. Ensure the guard macro name matches the file identifier exactly
(TRTLLM_RNNCACHETRANSCEIVER_H) and update any file-local comment markers
accordingly.
- Around line 1-6: Update the file header comment's copyright year from 2025 to
2026 in the top-of-file comment block of
tensorrt_llm/batch_manager/rnnCacheTransceiver.h; locate the copyright line in
the file's leading comment (the block beginning "/*" at the top) and change
"2025" to "2026" so the header reflects the latest modification year.

In `@cpp/include/tensorrt_llm/executor/dataTransceiverState.h`:
- Around line 610-615: Remove the inline comment " // is this needed?" next to
the mRnnCacheState declaration in the class (friend class Serialization;
std::optional<rnn_cache::RnnCacheState> mRnnCacheState), and if this represents
a tracked TODO open an issue or add a documented TODO elsewhere rather than
leaving an ambiguous inline note; simply delete the comment or replace it with a
proper tracked reference.

In `@cpp/tensorrt_llm/batch_manager/baseTransBuffer.cpp`:
- Around line 1-16: Update the copyright year in the file header comment: locate
the SPDX header line that reads "Copyright (c) 2025 NVIDIA CORPORATION &
AFFILIATES" (the SPDX comment block near the top of baseTransBuffer.cpp) and
change the year from 2025 to 2026 so the header reflects the current
modification year; ensure the SPDX-License-Identifier and surrounding comment
formatting remain unchanged.

In `@cpp/tensorrt_llm/batch_manager/baseTransBuffer.h`:
- Around line 1-16: The file header's copyright year is outdated (shows 2025);
update the year to 2026 in the top comment — specifically modify the
SPDX-FileCopyrightText line in baseTransBuffer.h to read 2026 so the header
matches the latest changes.
- Around line 18-44: The header currently uses `#pragma` once and ‘///’ comments;
replace `#pragma` once with a traditional include guard macro named
TRTLLM_BASETRANSBUFFER_H (wrap the entire header between `#ifndef`
TRTLLM_BASETRANSBUFFER_H / `#define` TRTLLM_BASETRANSBUFFER_H and end with `#endif`)
and convert the public API doc comments (e.g., the comment above the
BaseTransBufferManager class and any member docs) from triple-slash “///” to the
required Doxygen style using “//!” or “//!<” as appropriate; keep all existing
includes, namespaces and declarations (class BaseTransBufferManager, forward
declaration FabricMemory, etc.) unchanged other than the guard and comment style
swap.

In `@cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp`:
- Around line 458-630: The response loop condition is wrong and mReadyResponses
access is not fully protected causing UB; change the loop in
BaseCacheSenderImpl::response to continue while (!mTerminate || mAnyReady) so we
still drain ready responses after terminate, and ensure all reads/iterators of
mReadyResponses are done under mSenderMutex: inside response() acquire
mSenderMutex, locate the iterator via
mReadyResponses.find/getCurrentRequestId(), move the Response out (std::move)
and erase the map entry while still holding mSenderMutex, then release the lock
before calling sendResponse/sendAndRemoveResponse/asyncSendAndRemoveResponse;
remove or adjust getCurrentResponse() so it does not return a dangling iterator
(or make it only be used while holding mSenderMutex) and ensure removeResponse()
still updates mAnyReady under mCondMutex as before.

In `@cpp/tensorrt_llm/batch_manager/rnnCacheFormatter.cpp`:
- Around line 123-140: Validate that selfRank yields a valid PP rank before
indexing selfNumLayerPerPP: ensure selfRank is non-negative and that computed
selfPPRank (selfRank / selfTPNum) is less than selfNumLayerPerPP.size() (or
equivalently selfRank < selfTPNum * selfPPNum). If the check fails, fail fast
(e.g., use TLLM_CHECK or return an error) so that the later use of
selfNumLayerPerPP[selfPPRank] in the function (around variables selfTPRank,
selfPPRank, selfStartLayerId, selfEndLayerId) cannot index out of bounds.
- Around line 18-26: Add a direct include for the header that declares std::iota
by adding `#include` <numeric> at the top of rnnCacheFormatter.cpp (near the other
includes) so that uses of std::iota (seen in this file) do not rely on
transitive includes; update the include section in the file containing
rnnCacheFormatter.cpp to reference <numeric>.
- Around line 36-46: RnnCacheFormatter::format and ::unformat currently always
TLLM_THROW and will crash runtime; replace the throws with real implementations
or safe no-ops: implement format(TransferSession& session) to check for RNN
state on the session (e.g., session.hasRnnState()/session.getRnnState()),
serialize the state into the session transfer buffer using the TransferSession
write/append APIs and set any flags/lengths expected by the receiver; implement
unformat(TransferSession& session) to parse the incoming buffer, deserialize RNN
state and restore it into the session (e.g.,
session.setRnnState()/session.applyRnnState()); if you cannot implement
serialization yet, remove the TLLM_THROW and instead perform a guarded
early-return that logs a warning and clears any RNN flags on TransferSession so
callers on sendAsync()/receiveAsync() paths do not hard-fail.

In `@cpp/tensorrt_llm/batch_manager/rnnCacheTransceiver.cpp`:
- Around line 138-142: The code calls mCacheTransceiverConfig.value() before
verifying the optional is set, which can throw; fix by guarding the optional
first (e.g., check mCacheTransceiverConfig.has_value() before calling value() or
use mCacheTransceiverConfig->getBackendType()) and update the
TLLM_CHECK_WITH_INFO to validate that mCacheTransceiverConfig has a value and
that its getBackendType() is present and not DEFAULT; reference
mCacheTransceiverConfig, CacheTransceiverConfig::BackendType, getBackendType,
and TLLM_CHECK_WITH_INFO to locate where to change the check.
- Around line 61-95: The RNN-specific request/response hooks are unimplemented:
implement RnnCacheSender::recvRequestInfo to deserialize a RequestInfo into the
sender's state and return a valid RequestInfo (mirroring other cache types), and
implement RnnCacheReceiver::sendRequestInfo to build and return a
TransferSession populated with RNN-specific request metadata (including
RnnCacheState and CommState via mManager->getCommState()) so
BaseCacheSenderImpl::response() and BaseCacheReceiverImpl::requestSync() can
operate; if you cannot implement transfer logic right now, explicitly gate the
feature by having these functions return a clear no-op/default (and not throw)
or return an error TransferSession/RequestInfo that upstream handles,
referencing the methods RnnCacheSender::recvRequestInfo and
RnnCacheReceiver::sendRequestInfo and the member mSelfState and mManager when
making changes.
- Around line 152-156: The call to dlerror() in the TLLM_CHECK_WITH_INFO around
mWrapperLibHandle (after dllOpen) is not portable and breaks Windows builds;
change the error text construction to use platform-specific error retrieval: on
POSIX keep dlerror(), on Windows use GetLastError() + FormatMessage to produce a
printable message, then pass that message into TLLM_CHECK_WITH_INFO. Update the
block around mWrapperLibHandle (and the same pattern wherever dlerror() is used)
so the error string is created conditionally (e.g., using `#ifdef` _WIN32) before
calling TLLM_CHECK_WITH_INFO.

In `@cpp/tensorrt_llm/batch_manager/rnnStateManager.cpp`:
- Around line 225-288: The shared containers mFreeBlocks, mCacheIndex, and
mTempBlocks are accessed concurrently; add a mutex member (e.g., mMutex) and use
a lock (std::lock_guard or std::scoped_lock) at the start of
allocateCacheBlocks, freeCacheBlock, getCacheIndex, and getStateIndices to
protect all reads/writes to those members (follow the same locking pattern used
in kvCacheManager/peftCacheManager); ensure locks cover early-return checks and
modifications (e.g., the TLLM_CHECK_WITH_INFO checks and
pop_back/push_back/erase operations) so no unsynchronized access occurs while
the GIL is released.

In `@cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp`:
- Around line 398-432: The exposed RnnStateManager methods allocateCacheBlocks,
freeCacheBlock, getStateIndices and the reader getCacheIndex access and mutate
shared members (mCacheIndex, mFreeBlocks, mTempBlocks) without synchronization;
remove nb::call_guard<nb::gil_scoped_release>() from those .def(...) bindings so
the GIL is held during calls, or alternatively add a std::mutex member to
tb::rnn_state_manager::RnnStateManager and lock it (e.g., std::scoped_lock)
inside the implementations of allocateCacheBlocks, freeCacheBlock,
getStateIndices and getCacheIndex to make container access thread-safe, then
keep or remove the gil_scoped_release accordingly.

In `@tensorrt_llm/_torch/attention_backend/interface.py`:
- Around line 295-312: The _prepare_mamba_metadata method can leave an existing
mamba_metadata instance sized for a previous max_num_requests, causing
buffer/share-size mismatches; update it to detect when an existing
self.mamba_metadata does not match current sizing (e.g., compare its configured
max_num_requests or mamba_chunk_size) and reinitialize it by creating a fresh
Mamba2Metadata(self.max_num_requests, self.mamba_chunk_size) (or set
self.mamba_metadata = False and recreate) before calling prepare; ensure this
logic lives in _prepare_mamba_metadata and references self.mamba_metadata,
self.max_num_requests, self.mamba_chunk_size and Mamba2Metadata so the metadata
is recreated whenever sizing changes.

In `@tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py`:
- Around line 94-111: The add_dummy_requests method currently has an unused
kwargs parameter causing lint ARG002; update its signature to rename kwargs to
_kwargs (def add_dummy_requests(self, request_ids: List[int], **_kwargs):) or
explicitly del kwargs at the start of the method, leaving the body calling
self.mamba_impl.allocate_cache_blocks(request_ids) unchanged; this silences the
linter while preserving behavior in add_dummy_requests and keeps references to
free_resources, get_cache_index, get_state_indices, get_conv_states, and
get_ssm_states intact.

🧹 Nitpick comments (12)

tensorrt_llm/_torch/modules/mamba/mamba2_metadata.py (1)
21-23: Keep CUDA graph runner namespace in imports.
Prefer module-qualified access rather than pulling the constant directly into the local namespace.
♻️ Proposed refactor
-from tensorrt_llm._torch.pyexecutor.cuda_graph_runner import \
-    CUDA_GRAPH_DUMMY_REQUEST_ID
+import tensorrt_llm._torch.pyexecutor.cuda_graph_runner as cuda_graph_runner
@@
-                req_id == CUDA_GRAPH_DUMMY_REQUEST_ID
+                req_id == cuda_graph_runner.CUDA_GRAPH_DUMMY_REQUEST_ID
As per coding guidelines, keep module namespaces for imports.
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)
280-285: Align Doxygen style and parameter naming.
Use //! comments in headers and avoid the m-prefix for non-member parameters. Also update the definition in cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp to match.
♻️ Proposed refactor
-/// `@brief` Gather request IDs across all ranks in the communicator.
-/// `@param` comm The communicator to use for gathering.
-/// `@param` requestIds The local request IDs to gather.
-/// `@return` All request IDs from all ranks.
+//! `@brief` Gather request IDs across all ranks in the communicator.
+//! `@param` comm The communicator to use for gathering.
+//! `@param` requestIds The local request IDs to gather.
+//! `@return` All request IDs from all ranks.
 std::vector<LlmRequest::RequestIdType> gatherRequestIds(
-    std::shared_ptr<CacheTransceiverComm> const& mComm, std::vector<LlmRequest::RequestIdType> const& requestIds);
+    std::shared_ptr<CacheTransceiverComm> const& comm, std::vector<LlmRequest::RequestIdType> const& requestIds);
As per coding guidelines, please align comment style and naming.
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (1)
20-26: Preserve namespace for torch_dtype_to_binding import.
Keep module namespace to align with the import guidelines.
♻️ Proposed refactor
-import tensorrt_llm.bindings
+import tensorrt_llm.bindings
+import tensorrt_llm._utils as tllm_utils
@@
-from tensorrt_llm._utils import torch_dtype_to_binding
@@
-        dtype_binding = torch_dtype_to_binding(dtype)
-        ssm_cache_dtype_binding = torch_dtype_to_binding(
+        dtype_binding = tllm_utils.torch_dtype_to_binding(dtype)
+        ssm_cache_dtype_binding = tllm_utils.torch_dtype_to_binding(
             ssm_cache_dtype if ssm_cache_dtype is not None else dtype)
As per coding guidelines, please keep module namespaces for imports.
tensorrt_llm/_torch/attention_backend/interface.py (1)
24-25: Keep mamba_cache_manager namespace in imports.
Prefer module-qualified access instead of importing the class directly.
♻️ Proposed refactor
-from ..pyexecutor.mamba_cache_manager import MambaCacheManager
+from ..pyexecutor import mamba_cache_manager
@@
-            if (self.kv_cache_manager is not None
-                    and isinstance(self.kv_cache_manager, MambaCacheManager)):
+            if (self.kv_cache_manager is not None
+                    and isinstance(self.kv_cache_manager,
+                                   mamba_cache_manager.MambaCacheManager)):
As per coding guidelines, please keep module namespaces for imports.
tensorrt_llm/_torch/models/modeling_nemotron_h.py (1)

416-416: Consider removing unused attribute.

The self.mamba_metadata attribute is declared but no longer used in the forward() method since mamba_metadata is now derived from attn_metadata.mamba_metadata. This attribute appears to be dead code after this refactoring.

tensorrt_llm/_torch/models/modeling_qwen3_next.py (1)

1200-1200: Consider removing unused attribute.

Similar to modeling_nemotron_h.py, the self.mamba_metadata attribute is declared but no longer used in forward() after this refactoring.

cpp/include/tensorrt_llm/executor/serialization.h (1)

114-117: Add Doxygen comments for the new RnnCacheState serialization APIs.
These are new public header declarations; please add //! docs for deserialize/serialize/serializedSize to keep the header compliant.

As per coding guidelines, please document new public APIs.
cpp/tests/unit_tests/batch_manager/rnnCacheFormatterTest.cpp (1)
19-35: Make the fixed model-config values const/constexpr.
These locals never change; marking them immutable reduces accidental edits and aligns with style.
♻️ Proposed tweak
-        SizeType32 dState = 16;
-        SizeType32 dConv = 4;
-        SizeType32 hiddenSize = 256;
-        SizeType32 headDim = 64;
-        SizeType32 convDimSize = 128;
-        SizeType32 nGroups = 1;
-        SizeType32 numHeads = 4;
-        auto convDtype = nvinfer1::DataType::kFLOAT;
-        auto ssmDtype = nvinfer1::DataType::kFLOAT;
+        SizeType32 const dState = 16;
+        SizeType32 const dConv = 4;
+        SizeType32 const hiddenSize = 256;
+        SizeType32 const headDim = 64;
+        SizeType32 const convDimSize = 128;
+        SizeType32 const nGroups = 1;
+        SizeType32 const numHeads = 4;
+        auto const convDtype = nvinfer1::DataType::kFLOAT;
+        auto const ssmDtype = nvinfer1::DataType::kFLOAT;
As per coding guidelines, please prefer `const` for immutable locals.
cpp/include/tensorrt_llm/executor/dataTransceiverState.h (1)

419-584: Add Doxygen //! docs for the new RnnCacheState and accessors.
These are new public header APIs; please document them with //! (and //!< for members where applicable).

As per coding guidelines, new public interfaces should be Doxygen-documented.
cpp/tensorrt_llm/batch_manager/rnnStateManager.cpp (1)
86-133: Consider simplifying getPpLayers - the ppSize==1 branch duplicates logic.

The two branches (ppSize==1 and ppSize>1) have similar logic that could potentially be unified. When ppSize==1, the formula startLayer = ppRank * layersPerRank + std::min(ppRank, remainder) would yield startLayer = 0 and endLayer = totalLayers, making the multi-rank branch work correctly for the single-rank case.
♻️ Suggested simplification
 std::vector<RnnStateManager::SizeType32> RnnStateManager::getPpLayers(
     SizeType32 numLayers, WorldConfig const& worldConfig, std::optional<std::vector<bool>> const& layerMask)
 {
     auto const ppSize = worldConfig.getPipelineParallelism();
     auto const ppRank = worldConfig.getPipelineParallelRank();
 
     std::vector<SizeType32> ppLayers;
 
-    if (ppSize == 1)
-    {
-        SizeType32 totalLayers = numLayers;
-        if (layerMask.has_value())
-        {
-            totalLayers = static_cast<SizeType32>(layerMask->size());
-        }
-        for (SizeType32 i = 0; i < totalLayers; ++i)
-        {
-            if (!layerMask.has_value() || (*layerMask)[i])
-            {
-                ppLayers.push_back(i);
-            }
-        }
-    }
-    else
-    {
-        SizeType32 totalLayers = numLayers;
-        if (layerMask.has_value())
-        {
-            totalLayers = static_cast<SizeType32>(layerMask->size());
-        }
+    SizeType32 totalLayers = numLayers;
+    if (layerMask.has_value())
+    {
+        totalLayers = static_cast<SizeType32>(layerMask->size());
+    }
 
-        SizeType32 layersPerRank = totalLayers / ppSize;
-        SizeType32 remainder = totalLayers % ppSize;
+    SizeType32 layersPerRank = totalLayers / ppSize;
+    SizeType32 remainder = totalLayers % ppSize;
 
-        SizeType32 startLayer = ppRank * layersPerRank + std::min(ppRank, remainder);
-        SizeType32 endLayer = startLayer + layersPerRank + (ppRank < remainder ? 1 : 0);
+    SizeType32 startLayer = ppRank * layersPerRank + std::min(ppRank, remainder);
+    SizeType32 endLayer = startLayer + layersPerRank + (ppRank < remainder ? 1 : 0);
 
-        for (SizeType32 i = startLayer; i < endLayer; ++i)
+    for (SizeType32 i = startLayer; i < endLayer; ++i)
+    {
+        if (!layerMask.has_value() || (*layerMask)[i])
         {
-            if (!layerMask.has_value() || (*layerMask)[i])
-            {
-                ppLayers.push_back(i);
-            }
+            ppLayers.push_back(i);
         }
     }
 
     return ppLayers;
 }
cpp/tensorrt_llm/batch_manager/rnnCacheTransceiver.cpp (2)
162-164: Avoid C‑style cast for dlsym result.

Line 163 uses a C‑style cast, which violates the C++ guidelines and is non‑idiomatic for function pointers.
♻️ Suggested approach
-        std::unique_ptr<executor::kv_cache::ConnectionManager> (*makeUcxConnectionManager)();
-        *(void**) (&makeUcxConnectionManager) = load_sym(mWrapperLibHandle, "makeUcxConnectionManager");
+        using MakeUcxFn = std::unique_ptr<executor::kv_cache::ConnectionManager>(*)();
+        auto makeUcxConnectionManager =
+            reinterpret_cast<MakeUcxFn>(load_sym(mWrapperLibHandle, "makeUcxConnectionManager"));
235-239: Layer‑wise RNN transfer currently throws.

Line 238 throws unconditionally. If callers can reach this override, it will abort the request flow. Consider implementing it or explicitly documenting/guarding this path.

coderabbitai · 2026-01-22T19:53:24Z

cpp/include/tensorrt_llm/batch_manager/rnnCacheTransceiver.h

+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * ...
+ */


⚠️ Potential issue | 🟡 Minor

Update the copyright year to 2026.

Line 2 still lists 2025. Please update to the year of latest meaningful modification.
As per coding guidelines, this should be 2026.

🤖 Prompt for AI Agents

In `@cpp/include/tensorrt_llm/batch_manager/rnnCacheTransceiver.h` around lines 1 - 6, Update the file header comment's copyright year from 2025 to 2026 in the top-of-file comment block of tensorrt_llm/batch_manager/rnnCacheTransceiver.h; locate the copyright line in the file's leading comment (the block beginning "/*" at the top) and change "2025" to "2026" so the header reflects the latest modification year.

coderabbitai · 2026-01-22T19:53:24Z

cpp/include/tensorrt_llm/batch_manager/rnnCacheTransceiver.h

+#pragma once
+
+#include "tensorrt_llm/batch_manager/cacheTransceiver.h"
+#include "tensorrt_llm/batch_manager/common.h"
+#include "tensorrt_llm/batch_manager/dataTransceiver.h"
+#include "tensorrt_llm/batch_manager/llmRequest.h"
+#include "tensorrt_llm/batch_manager/rnnCacheFormatter.h"
+#include "tensorrt_llm/batch_manager/rnnStateManager.h"
+#include "tensorrt_llm/executor/cacheCommunicator.h"
+#include "tensorrt_llm/executor/dataTransceiverState.h"
+#include "tensorrt_llm/runtime/worldConfig.h"
+
+#include <future>
+#include <memory>
+#include <mutex>
+#include <optional>
+#include <vector>
+
+using SizeType32 = tensorrt_llm::runtime::SizeType32;
+
+namespace tensorrt_llm::batch_manager
+{
+
+namespace rnn_state_manager
+{
+class RnnStateManager;
+}
+
+/// @brief RNN cache sender - inherits from BaseCacheSenderImpl with RNN-specific logic.
+class RnnCacheSender : public BaseCacheSenderImpl
+{
+public:
+    /// @brief Constructor.
+    RnnCacheSender(executor::kv_cache::ConnectionManager* manager, executor::rnn_cache::RnnCacheState selfCacheState,
+        SizeType32 selfIndex, std::unique_ptr<BaseCacheFormatter> formatter);
+
+    /// @brief Receive request information - RNN-specific implementation.


⚠️ Potential issue | 🟡 Minor

Use TRTLLM_ include guards and //! Doxygen comments.*

Line 8 uses #pragma once, and the public API docs use ///. The guidelines require an explicit TRTLLM_<FILENAME>_H guard and //!///!< Doxygen style.

As per coding guidelines, please switch to the mandated guard and comment format.

🧰 Tools

🪛 Clang (14.0.6)

[error] 10-10: 'tensorrt_llm/batch_manager/cacheTransceiver.h' file not found

(clang-diagnostic-error)

🤖 Prompt for AI Agents

In `@cpp/include/tensorrt_llm/batch_manager/rnnCacheTransceiver.h` around lines 8 - 44, Replace the non-compliant header protection and comment style: remove "#pragma once" and add the explicit include guard "TRTLLM_RNNCACHETRANSCEIVER_H" (i.e., `#ifndef` TRTLLM_RNNCACHETRANSCEIVER_H / `#define` ... / `#endif`) at the top/bottom of the header, and convert C++03-style triple-slash Doxygen comments (e.g., "/// `@brief`" and other "///" comments associated with RnnCacheSender, its constructor and members) to the project standard "//!"/"//!<" style so class-level docs and member descriptions (for RnnCacheSender, RnnCacheFormatter references, etc.) use //! and //!< where appropriate. Ensure the guard macro name matches the file identifier exactly (TRTLLM_RNNCACHETRANSCEIVER_H) and update any file-local comment markers accordingly.

coderabbitai · 2026-01-22T19:53:24Z

cpp/include/tensorrt_llm/executor/dataTransceiverState.h

 private:
    friend class Serialization;
    std::optional<kv_cache::CacheState> mCacheState;
    std::optional<kv_cache::CommState> mCommState;
+    std::optional<rnn_cache::RnnCacheState> mRnnCacheState; // is this needed?
 };


⚠️ Potential issue | 🟡 Minor

Remove the inline “is this needed?” comment.
If this is a TODO, please track it explicitly; otherwise remove the ambiguity.

🧹 Suggested cleanup

- std::optional<rnn_cache::RnnCacheState> mRnnCacheState; // is this needed? + std::optional<rnn_cache::RnnCacheState> mRnnCacheState;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

private:

friend class Serialization;

std::optional<kv_cache::CacheState> mCacheState;

std::optional<kv_cache::CommState> mCommState;

std::optional<rnn_cache::RnnCacheState> mRnnCacheState; // is this needed?

};

private:

friend class Serialization;

std::optional<kv_cache::CacheState> mCacheState;

std::optional<kv_cache::CommState> mCommState;

std::optional<rnn_cache::RnnCacheState> mRnnCacheState;

};

🤖 Prompt for AI Agents

In `@cpp/include/tensorrt_llm/executor/dataTransceiverState.h` around lines 610 - 615, Remove the inline comment " // is this needed?" next to the mRnnCacheState declaration in the class (friend class Serialization; std::optional<rnn_cache::RnnCacheState> mRnnCacheState), and if this represents a tracked TODO open an issue or add a documented TODO elsewhere rather than leaving an ambiguous inline note; simply delete the comment or replace it with a proper tracked reference.

coderabbitai · 2026-01-22T19:53:24Z

cpp/tensorrt_llm/batch_manager/baseTransBuffer.cpp

+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */


⚠️ Potential issue | 🟡 Minor

Update the copyright year to 2026.

Line 2 still lists 2025; this file is new in a 2026 PR.
As per coding guidelines, the header should reflect the latest meaningful modification year.

🤖 Prompt for AI Agents

In `@cpp/tensorrt_llm/batch_manager/baseTransBuffer.cpp` around lines 1 - 16, Update the copyright year in the file header comment: locate the SPDX header line that reads "Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES" (the SPDX comment block near the top of baseTransBuffer.cpp) and change the year from 2025 to 2026 so the header reflects the current modification year; ensure the SPDX-License-Identifier and surrounding comment formatting remain unchanged.

coderabbitai · 2026-01-22T19:53:24Z

cpp/tensorrt_llm/batch_manager/baseTransBuffer.h

+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */


⚠️ Potential issue | 🟡 Minor

Update the copyright year to 2026.

Line 2 still lists 2025; please align with the 2026 modification.
As per coding guidelines, the header year should match the latest meaningful change.

🤖 Prompt for AI Agents

In `@cpp/tensorrt_llm/batch_manager/baseTransBuffer.h` around lines 1 - 16, The file header's copyright year is outdated (shows 2025); update the year to 2026 in the top comment — specifically modify the SPDX-FileCopyrightText line in baseTransBuffer.h to read 2026 so the header matches the latest changes.

coderabbitai · 2026-01-22T19:53:25Z

cpp/tensorrt_llm/batch_manager/rnnCacheTransceiver.cpp

+        std::lock_guard<std::mutex> lock(mDllMutex);
+        mWrapperLibHandle = dllOpen(UCX_WRAPPER_LIB_NAME);
+        TLLM_CHECK_WITH_INFO(
+            mWrapperLibHandle != nullptr, "UCX wrapper library is not open correctly. error : %s", dlerror());
+        auto load_sym = [](void* handle, char const* name)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n cpp/tensorrt_llm/batch_manager/rnnCacheTransceiver.cpp | sed -n '140,170p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 1938

🏁 Script executed:

head -80 cpp/tensorrt_llm/batch_manager/rnnCacheTransceiver.cpp

Repository: NVIDIA/TensorRT-LLM

Length of output: 2911

🏁 Script executed:

cat -n cpp/tensorrt_llm/batch_manager/rnnCacheTransceiver.cpp | sed -n '130,180p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 2937

Windows build breaks: dlerror() is undefined on Windows.

Line 155 calls dlerror() without platform guards. On Windows, dlerror() doesn't exist in the C standard library. Use GetLastError() / FormatMessage() for Windows and dlerror() for non-Windows platforms.

Example fix

- TLLM_CHECK_WITH_INFO( - mWrapperLibHandle != nullptr, "UCX wrapper library is not open correctly. error : %s", dlerror()); +#if defined(_WIN32) + TLLM_CHECK_WITH_INFO(mWrapperLibHandle != nullptr, + "UCX wrapper library is not open correctly. error: %lu", GetLastError()); +#else + TLLM_CHECK_WITH_INFO( + mWrapperLibHandle != nullptr, "UCX wrapper library is not open correctly. error : %s", dlerror()); +#endif

🤖 Prompt for AI Agents

In `@cpp/tensorrt_llm/batch_manager/rnnCacheTransceiver.cpp` around lines 152 - 156, The call to dlerror() in the TLLM_CHECK_WITH_INFO around mWrapperLibHandle (after dllOpen) is not portable and breaks Windows builds; change the error text construction to use platform-specific error retrieval: on POSIX keep dlerror(), on Windows use GetLastError() + FormatMessage to produce a printable message, then pass that message into TLLM_CHECK_WITH_INFO. Update the block around mWrapperLibHandle (and the same pattern wherever dlerror() is used) so the error string is created conditionally (e.g., using `#ifdef` _WIN32) before calling TLLM_CHECK_WITH_INFO.

coderabbitai · 2026-01-22T19:53:25Z

cpp/tensorrt_llm/batch_manager/rnnStateManager.cpp

+void RnnStateManager::allocateCacheBlocks(std::vector<RequestIdType> const& requestIds)
+{
+    for (auto const& requestId : requestIds)
+    {
+        auto it = mCacheIndex.find(requestId);
+        if (it == mCacheIndex.end())
+        {
+            TLLM_CHECK_WITH_INFO(!mFreeBlocks.empty(), "Run out of RNN state cache blocks");
+            SizeType32 const block = mFreeBlocks.back();
+            mFreeBlocks.pop_back();
+            mCacheIndex[requestId] = block;
+        }
+    }
+}
+
+void RnnStateManager::freeCacheBlock(RequestIdType requestId)
+{
+    auto it = mCacheIndex.find(requestId);
+    if (it != mCacheIndex.end())
+    {
+        mFreeBlocks.push_back(it->second);
+        mCacheIndex.erase(it);
+    }
+}
+
+RnnStateManager::SizeType32 RnnStateManager::getCacheIndex(RequestIdType requestId) const
+{
+    auto it = mCacheIndex.find(requestId);
+    TLLM_CHECK_WITH_INFO(it != mCacheIndex.end(), "Request ID not found in cache index");
+    return it->second;
+}
+
+std::vector<RnnStateManager::SizeType32> RnnStateManager::getStateIndices(
+    std::vector<RequestIdType> const& requestIds, std::vector<bool> const& isPadding)
+{
+    TLLM_CHECK_WITH_INFO(requestIds.size() == isPadding.size(), "requestIds and isPadding must have the same size");
+
+    for (auto const& block : mTempBlocks)
+    {
+        mFreeBlocks.push_back(block);
+    }
+    mTempBlocks.clear();
+
+    std::vector<SizeType32> result;
+    result.reserve(requestIds.size());
+
+    for (size_t i = 0; i < requestIds.size(); ++i)
+    {
+        if (isPadding[i])
+        {
+            TLLM_CHECK_WITH_INFO(!mFreeBlocks.empty(), "Run out of RNN state cache blocks for padding");
+            SizeType32 const block = mFreeBlocks.back();
+            mFreeBlocks.pop_back();
+            mTempBlocks.push_back(block);
+            result.push_back(block);
+        }
+        else
+        {
+            result.push_back(getCacheIndex(requestIds[i]));
+        }
+    }
+
+    return result;
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find RnnStateManager class definition and declaration fd -e h -e cpp | xargs grep -l "class RnnStateManager" | head -5

Repository: NVIDIA/TensorRT-LLM

Length of output: 182

🏁 Script executed:

#!/bin/bash # Get the RnnStateManager class definition cat -n cpp/tensorrt_llm/batch_manager/rnnStateManager.h | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 143

🏁 Script executed:

#!/bin/bash # Check for mutexes and synchronization in RnnStateManager rg -n "mutex|lock|Lock|Mutex" cpp/tensorrt_llm/batch_manager/rnnStateManager.h cpp/tensorrt_llm/batch_manager/rnnStateManager.cpp

Repository: NVIDIA/TensorRT-LLM

Length of output: 1981

🏁 Script executed:

#!/bin/bash # Find all call sites of these RnnStateManager methods rg -n "allocateCacheBlocks|freeCacheBlock|getStateIndices|getCacheIndex" --type cpp | grep -v "rnnStateManager.cpp:" | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 1890

🏁 Script executed:

#!/bin/bash # Read the RnnStateManager header file cat -n cpp/include/tensorrt_llm/batch_manager/rnnStateManager.h

Repository: NVIDIA/TensorRT-LLM

Length of output: 4688

🏁 Script executed:

#!/bin/bash # Check how RnnStateManager is used in the batch manager rg -n "RnnStateManager" cpp/tensorrt_llm/batch_manager/ --type cpp | grep -E "(\.allocateCacheBlocks|\.freeCacheBlock|\.getStateIndices)" | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

#!/bin/bash # Check batch manager threading model and synchronization rg -n "std::mutex|std::lock|mutable.*mutex" cpp/tensorrt_llm/batch_manager/*.cpp | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 2369

🏁 Script executed:

#!/bin/bash # Find where RnnStateManager methods are called in the batch manager rg -n "allocateCacheBlocks|freeCacheBlock|getStateIndices" cpp/tensorrt_llm/batch_manager/ --type cpp -B3 -A3 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 1828

🏁 Script executed:

#!/bin/bash # Check if RnnStateManager is stored as a member in batch manager rg -n "RnnStateManager" cpp/tensorrt_llm/batch_manager/batch_manager.h cpp/tensorrt_llm/batch_manager/batch_manager.cpp | head -30

Repository: NVIDIA/TensorRT-LLM

Length of output: 238

🏁 Script executed:

#!/bin/bash # Check the batch manager's threading model and synchronization rg -n "mRequestMutex|mGenerationMutex|mQueueMutex" cpp/tensorrt_llm/batch_manager/batch_manager.h -A2 -B2 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 149

🏁 Script executed:

#!/bin/bash # Find batch manager files find . -name "*batch_manager*" -type f | grep -E "\.(h|cpp)$" | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

#!/bin/bash # Search for the class that owns RnnStateManager rg -n "RnnStateManager\s+mRnnStateManager|std::.*RnnStateManager" --type cpp --type h | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 722

🏁 Script executed:

#!/bin/bash # Check Python binding patterns to understand threading context rg -n "\.allocate_cache_blocks|\.free_cache_block|\.get_state_indices" cpp/pybind/batch_manager/bindings.cpp -B5 -A5 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 140

🏁 Script executed:

#!/bin/bash # Find nanobind bindings file find . -name "*bindings*" -type f | grep -E "\.(h|cpp)$" | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 1048

🏁 Script executed:

#!/bin/bash # Check trtGptModelInflightBatching to understand threading model rg -n "mRnnStateManager" cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp -B3 -A3 | head -60

Repository: NVIDIA/TensorRT-LLM

Length of output: 1633

🏁 Script executed:

#!/bin/bash # Look for mutex in trtGptModelInflightBatching.h rg -n "mutex|mMutex|mLock" cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

#!/antml:bash # Check for thread safety patterns around allocateCacheBlocks and freeCacheBlock calls rg -n "allocateCacheBlocks|freeCacheBlock" cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp -B10 -A5 | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

#!/bin/bash # Find where allocateCacheBlocks is actually called rg -n "allocateCacheBlocks" --type cpp -B5 -A5 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 4868

🏁 Script executed:

#!/bin/bash # Check the runtime buffers to see if RnnStateManager is accessed there rg -n "mRnnStateManager" cpp/tensorrt_llm/batch_manager/ --type cpp -B3 -A3 | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 6714

🏁 Script executed:

#!/bin/bash # Look for synchronization patterns in trtGptModelInflightBatching rg -n "mMutex|mLock|lock_guard|unique_lock" cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h -B2 -A2 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

#!/bin/bash # Check if there's any documentation about thread safety in RnnStateManager rg -n "thread|Thread|concurrent|Concurrent|safe|Safe" cpp/tensorrt_llm/batch_manager/rnnStateManager.cpp -B2 -A2

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

#!/bin/bash # Check the py::call_guard documentation pattern - see if it's used elsewhere and how rg -n "py::call_guard" cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp | head -30

Repository: NVIDIA/TensorRT-LLM

Length of output: 618

🏁 Script executed:

#!/bin/bash # Look for RuntimeBuffers to understand if prepareStep also accesses RnnStateManager rg -n "class RuntimeBuffers" --type cpp -A30 | head -60

Repository: NVIDIA/TensorRT-LLM

Length of output: 5838

🏁 Script executed:

#!/bin/bash # Verify the GIL release pattern is intentional by checking if other batch manager methods also use it rg -n "py::call_guard<py::gil_scoped_release>" cpp/tensorrt_llm/pybind/batch_manager/bindings.cpp | wc -l

Repository: NVIDIA/TensorRT-LLM

Length of output: 64

🏁 Script executed:

#!/bin/bash # Check if there's any documentation or TODO about thread safety in RnnStateManager rg -n "TODO|FIXME|thread|Thread|concurrent|safe" cpp/include/tensorrt_llm/batch_manager/rnnStateManager.h -i

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

#!/antml:bash # Look for any synchronization at the batch manager level that might protect RnnStateManager grep -n "mRnnStateManager" cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h -B5 -A5 | head -40

Repository: NVIDIA/TensorRT-LLM

Length of output: 746

Add thread synchronization to protect concurrent access to cache block state.

The Python bindings explicitly release the GIL with py::call_guard<py::gil_scoped_release>(), allowing concurrent calls from multiple Python threads. However, allocateCacheBlocks, freeCacheBlock, and getStateIndices modify shared state (mFreeBlocks, mCacheIndex, mTempBlocks) without any synchronization. Standard library containers like std::vector and std::unordered_map are not thread-safe for concurrent access.

Add a mutex (e.g., mMutex) to protect these operations, following the pattern used in kvCacheManager and peftCacheManager. Alternatively, document that single-threaded access is guaranteed by design and remove the GIL release guards.

🤖 Prompt for AI Agents

In `@cpp/tensorrt_llm/batch_manager/rnnStateManager.cpp` around lines 225 - 288, The shared containers mFreeBlocks, mCacheIndex, and mTempBlocks are accessed concurrently; add a mutex member (e.g., mMutex) and use a lock (std::lock_guard or std::scoped_lock) at the start of allocateCacheBlocks, freeCacheBlock, getCacheIndex, and getStateIndices to protect all reads/writes to those members (follow the same locking pattern used in kvCacheManager/peftCacheManager); ensure locks cover early-return checks and modifications (e.g., the TLLM_CHECK_WITH_INFO checks and pop_back/push_back/erase operations) so no unsynchronized access occurs while the GIL is released.

coderabbitai · 2026-01-22T19:53:25Z

cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp

    nb::class_<tb::rnn_state_manager::RnnStateManager>(m, "RnnStateManager")
        .def(nb::init<tr::SizeType32, tr::ModelConfig, tr::WorldConfig, tr::BufferManager>(),
-            nb::arg("max_num_sequences"), nb::arg("model_config"), nb::arg("world_config"), nb::arg("buffer_manager"));
+            nb::arg("max_num_sequences"), nb::arg("model_config"), nb::arg("world_config"), nb::arg("buffer_manager"),
+            nb::call_guard<nb::gil_scoped_release>())
+        .def(nb::init<tr::SizeType32, tr::SizeType32, tr::SizeType32, tr::SizeType32, tr::SizeType32, tr::SizeType32,
+                 tr::SizeType32, tr::WorldConfig const&, int64_t, nvinfer1::DataType, nvinfer1::DataType,
+                 std::optional<std::vector<bool>> const&>(),
+            nb::arg("d_state"), nb::arg("d_conv"), nb::arg("num_heads"), nb::arg("n_groups"), nb::arg("head_dim"),
+            nb::arg("num_layers"), nb::arg("max_batch_size"), nb::arg("world_config"), nb::arg("stream"),
+            nb::arg("dtype"), nb::arg("ssm_cache_dtype"), nb::arg("layer_mask") = std::nullopt,
+            nb::call_guard<nb::gil_scoped_release>())
+        .def("get_cache_index", &tb::rnn_state_manager::RnnStateManager::getCacheIndex, nb::arg("request_id"),
+            nb::call_guard<nb::gil_scoped_release>())
+        .def(
+            "get_conv_states",
+            [](tb::rnn_state_manager::RnnStateManager& self, tr::SizeType32 layerIdx) -> at::Tensor
+            {
+                auto tensor = self.getConvStates(layerIdx);
+                return tr::Torch::tensor(tensor);
+            },
+            nb::arg("layer_idx"), nb::call_guard<nb::gil_scoped_release>())
+        .def(
+            "get_ssm_states",
+            [](tb::rnn_state_manager::RnnStateManager& self, tr::SizeType32 layerIdx) -> at::Tensor
+            {
+                auto tensor = self.getSsmStates(layerIdx);
+                return tr::Torch::tensor(tensor);
+            },
+            nb::arg("layer_idx"), nb::call_guard<nb::gil_scoped_release>())
+        .def("allocate_cache_blocks", &tb::rnn_state_manager::RnnStateManager::allocateCacheBlocks,
+            nb::arg("request_ids"), nb::call_guard<nb::gil_scoped_release>())
+        .def("free_cache_block", &tb::rnn_state_manager::RnnStateManager::freeCacheBlock, nb::arg("request_id"),
+            nb::call_guard<nb::gil_scoped_release>())
+        .def("get_state_indices", &tb::rnn_state_manager::RnnStateManager::getStateIndices, nb::arg("request_ids"),
+            nb::arg("is_padding"), nb::call_guard<nb::gil_scoped_release>());


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find RnnStateManager class definition find . -type f $ -name "*.h" -o -name "*.hpp" -o -name "*.cpp" $ | xargs grep -l "class RnnStateManager" | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 182

🏁 Script executed:

# Search for the rnn_state_manager namespace/directory find . -type d -name "*rnn*" | grep -E "state|manager"

Repository: NVIDIA/TensorRT-LLM

Length of output: 45

🏁 Script executed:

# Look for files in batch_manager or related directories that contain RnnStateManager find cpp -type f -name "*.h" -o -name "*.hpp" | xargs grep -l "RnnStateManager" | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 405

🏁 Script executed:

# Read the RnnStateManager header file cat -n cpp/include/tensorrt_llm/batch_manager/rnnStateManager.h

Repository: NVIDIA/TensorRT-LLM

Length of output: 4688

🏁 Script executed:

# Find RnnStateManager implementation file find cpp -type f -name "*.cpp" | xargs grep -l "RnnStateManager::" | head -10

Repository: NVIDIA/TensorRT-LLM

Length of output: 217

🏁 Script executed:

# Read the RnnStateManager implementation file cat -n cpp/tensorrt_llm/batch_manager/rnnStateManager.cpp

Repository: NVIDIA/TensorRT-LLM

Length of output: 14506

Keep GIL or add synchronization for RnnStateManager cache operations.

allocateCacheBlocks, freeCacheBlock, and getStateIndices mutate shared containers (mCacheIndex, mFreeBlocks, mTempBlocks) without internal synchronization. Releasing the GIL with nb::call_guard<nb::gil_scoped_release>() allows concurrent Python threads to trigger data races on these containers. Similarly, getCacheIndex reads mCacheIndex which can be concurrently modified.

Either retain the GIL for these methods or add a mutex to RnnStateManager for thread-safe container access.

🤖 Prompt for AI Agents

In `@cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp` around lines 398 - 432, The exposed RnnStateManager methods allocateCacheBlocks, freeCacheBlock, getStateIndices and the reader getCacheIndex access and mutate shared members (mCacheIndex, mFreeBlocks, mTempBlocks) without synchronization; remove nb::call_guard<nb::gil_scoped_release>() from those .def(...) bindings so the GIL is held during calls, or alternatively add a std::mutex member to tb::rnn_state_manager::RnnStateManager and lock it (e.g., std::scoped_lock) inside the implementations of allocateCacheBlocks, freeCacheBlock, getStateIndices and getCacheIndex to make container access thread-safe, then keep or remove the gil_scoped_release accordingly.

coderabbitai · 2026-01-22T19:53:25Z

tensorrt_llm/_torch/attention_backend/interface.py

+        self._prepare_mamba_metadata()
+
+    def _prepare_mamba_metadata(self):
+        if self.mamba_metadata is False:
+            return
+
+        if self.mamba_metadata is None:
+            if (self.kv_cache_manager is not None
+                    and isinstance(self.kv_cache_manager, MambaCacheManager)):
+                from ..modules.mamba.mamba2_metadata import Mamba2Metadata
+                self.mamba_metadata = Mamba2Metadata(self.max_num_requests,
+                                                     self.mamba_chunk_size)
+            else:
+                self.mamba_metadata = False
+                return
+
+        self.mamba_metadata.prepare(self)



⚠️ Potential issue | 🟠 Major

Reinitialize Mamba metadata when sizing changes.
create_cuda_graph_metadata() uses a shallow copy; if mamba_metadata was already initialized, the new metadata instance can share buffers sized for the old max_num_requests, risking out-of-bounds or stale state. Reinitialize when the requested max changes.

🛠️ Suggested fix

def _prepare_mamba_metadata(self): if self.mamba_metadata is False: return - if self.mamba_metadata is None: - if (self.kv_cache_manager is not None - and isinstance(self.kv_cache_manager, MambaCacheManager)): - from ..modules.mamba.mamba2_metadata import Mamba2Metadata - self.mamba_metadata = Mamba2Metadata(self.max_num_requests, - self.mamba_chunk_size) - else: - self.mamba_metadata = False - return + if self.kv_cache_manager is None or not isinstance( + self.kv_cache_manager, MambaCacheManager): + self.mamba_metadata = False + return + + from ..modules.mamba.mamba2_metadata import Mamba2Metadata + if (self.mamba_metadata is None + or self.mamba_metadata.max_batch_size != self.max_num_requests): + self.mamba_metadata = Mamba2Metadata(self.max_num_requests, + self.mamba_chunk_size) self.mamba_metadata.prepare(self)

🤖 Prompt for AI Agents

In `@tensorrt_llm/_torch/attention_backend/interface.py` around lines 295 - 312, The _prepare_mamba_metadata method can leave an existing mamba_metadata instance sized for a previous max_num_requests, causing buffer/share-size mismatches; update it to detect when an existing self.mamba_metadata does not match current sizing (e.g., compare its configured max_num_requests or mamba_chunk_size) and reinitialize it by creating a fresh Mamba2Metadata(self.max_num_requests, self.mamba_chunk_size) (or set self.mamba_metadata = False and recreate) before calling prepare; ensure this logic lives in _prepare_mamba_metadata and references self.mamba_metadata, self.max_num_requests, self.mamba_chunk_size and Mamba2Metadata so the metadata is recreated whenever sizing changes.

coderabbitai · 2026-01-22T19:53:25Z

tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py

    def free_resources(self, request: LlmRequest):
-        request_id = request.py_request_id
-        if request_id in self.mamba_cache_index:
-            block = self.mamba_cache_index.pop(request_id)
-            self.mamba_cache_free_blocks.append(block)
+        self.mamba_impl.free_cache_block(request.py_request_id)
+
+    def add_dummy_requests(self, request_ids: List[int], **kwargs):
+        self.mamba_impl.allocate_cache_blocks(request_ids)

-    def get_state_indices(self) -> torch.Tensor:
-        return self.state_indices
+    def get_cache_index(self, request_id: int) -> int:
+        return self.mamba_impl.get_cache_index(request_id)
+
+    def get_state_indices(self, request_ids: List[int],
+                          is_padding: List[bool]) -> List[int]:
+        return self.mamba_impl.get_state_indices(request_ids, is_padding)

    def get_conv_states(self, layer_idx: int) -> torch.Tensor:
-        layer_offset = self.mamba_layer_offsets[layer_idx]
-        return self.conv_states[layer_offset]
+        return self.mamba_impl.get_conv_states(layer_idx)

    def get_ssm_states(self, layer_idx: int) -> torch.Tensor:
-        layer_offset = self.mamba_layer_offsets[layer_idx]
-        return self.ssm_states[layer_offset]
+        return self.mamba_impl.get_ssm_states(layer_idx)


⚠️ Potential issue | 🟡 Minor

Silence unused kwargs in add_dummy_requests.
Ruff flags the unused kwargs (ARG002). If the signature is intentional, rename to _kwargs (or del kwargs) to keep lint clean.

🧹 Proposed fix

- def add_dummy_requests(self, request_ids: List[int], **kwargs): + def add_dummy_requests(self, request_ids: List[int], **_kwargs): self.mamba_impl.allocate_cache_blocks(request_ids)

🧰 Tools

🪛 Ruff (0.14.13)

97-97: Unused method argument: kwargs

(ARG002)

🤖 Prompt for AI Agents

In `@tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py` around lines 94 - 111, The add_dummy_requests method currently has an unused kwargs parameter causing lint ARG002; update its signature to rename kwargs to _kwargs (def add_dummy_requests(self, request_ids: List[int], **_kwargs):) or explicitly del kwargs at the start of the method, leaving the body calling self.mamba_impl.allocate_cache_blocks(request_ids) unchanged; this silences the linter while preserving behavior in add_dummy_requests and keeps references to free_resources, get_cache_index, get_state_indices, get_conv_states, and get_ssm_states intact.

tensorrt-cicd · 2026-01-22T20:01:41Z

PR_Github #33217 [ run ] completed with state FAILURE. Commit: 559ca89
/LLM/main/L0_MergeRequest_PR pipeline #25664 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

NVShreyas · 2026-01-23T02:42:34Z

Will break this up into smaller PRs after cherry picking refactor only commits

NVShreyas · 2026-01-23T16:24:29Z

first PR #10957

Tabrizian and others added 8 commits January 16, 2026 00:45

[TRTLLM-10273][feat] Move MambaCacheManager from Python to C++

23c26eb

Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

chore: add rnn cache state

416b487

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

refactor: cache sender

e90f07b

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

refactor cache receiver

ad0bd11

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

fix: serialization

872243c

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

feat: start rnn cache formatter impl

8553b3c

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

feat: rnn cache transceiver impl

69cc6a2

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

refactor: cache trans buffer managers

559ca89

Signed-off-by: Shreyas Misra <shreyasm@nvidia.com>

NVShreyas requested review from a team as code owners January 22, 2026 19:38

NVShreyas requested review from QiJune, Tabrizian, Wanli-Jiang, omera-nv and schetlur-nv January 22, 2026 19:38

coderabbitai bot reviewed Jan 22, 2026

View reviewed changes

pcastonguay requested review from Shixiaowei02 and chuangz0 January 22, 2026 21:06

[None][feat] Cache Transceivers for Mamba States #10934

Are you sure you want to change the base?

[None][feat] Cache Transceivers for Mamba States #10934

Uh oh!

Conversation

NVShreyas commented Jan 22, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

NVShreyas commented Jan 22, 2026

Uh oh!

tensorrt-cicd commented Jan 22, 2026

Uh oh!

coderabbitai bot commented Jan 22, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jan 22, 2026

Uh oh!

NVShreyas commented Jan 23, 2026

Uh oh!

NVShreyas commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NVShreyas commented Jan 22, 2026 •

edited by coderabbitai bot

Loading