[TRTLLM-12339][feat] Support T5 encoder-decoder models in the PyTorch backend#13919
[TRTLLM-12339][feat] Support T5 encoder-decoder models in the PyTorch backend#13919cascade812 wants to merge 19 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
📝 WalkthroughWalkthroughAdds dual-pool KV caching with cross-attention, encoder iteration, scheduler gating, extended attention backends, new T5/BART models, API/bindings for encoder inputs and encoder outputs, and comprehensive tests/docs. ChangesEncoder–Decoder Integration
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 13
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (12)
tensorrt_llm/executor/executor.py (1)
123-170:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMirror
encoder_input_token_idsongenerate().This wires encoder inputs through
generate_async(), but the synchronousgenerate()wrapper still has no way to accept or forward them. Users of the blocking executor API are therefore missing the new encoder-decoder request path.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/executor/executor.py` around lines 123 - 170, The synchronous wrapper generate() does not accept or forward the new encoder_input_token_ids parameter, so callers using the blocking API can't use the encoder-decoder path; update the generate() signature to include encoder_input_token_ids: Optional[Union[torch.Tensor, np.ndarray, list]] = None and ensure it passes this argument through when constructing/dispatching the GenerationRequest (same name used in generate_async and GenerationRequest). Also propagate encoder_input_token_ids through any internal call sites or helper methods that currently forward other generation arguments to generate_async or create the GenerationRequest so behavior is identical to generate_async().tensorrt_llm/_torch/attention_backend/vanilla.py (1)
400-410:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDo not use a causal mask on flash-attn cross-attention.
The SDPA fallback correctly disables causality when Q/K lengths differ, but the flash-attn path still sets
causal=Truefor the default mask. For encoder-decoder cross-attention that blocks later encoder tokens and produces wrong attention scores in fp16/bf16.Suggested fix
attn_output_unpad = flash_attn_varlen_func( q, k, v, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, dropout_p=0.0, softmax_scale=softmax_scale, - causal=attention_mask == PredefinedAttentionMask.CAUSAL, + causal=(attention_mask == PredefinedAttentionMask.CAUSAL + and not is_cross), alibi_slopes=None, deterministic=False, return_attn_probs=False, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/attention_backend/vanilla.py` around lines 400 - 410, The flash-attn call incorrectly applies a causal mask for cross-attention; update the causal argument of flash_attn_varlen_func so it is True only for true self-attention. Compute a cross-attention condition (e.g., compare cu_seqlens_q and cu_seqlens_k shapes or values) and set causal = (attention_mask == PredefinedAttentionMask.CAUSAL) and not is_cross_attention before calling flash_attn_varlen_func; use the same logic/sentinel used by the SDPA fallback to detect differing Q/K lengths so encoder-decoder attention does not get a causal mask.tensorrt_llm/_torch/pyexecutor/llm_request.py (2)
334-339:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t put raw
encoder_outputtensors intoDiffwithout normalizing them first.
Diffis the cross-rank sync payload, butget_diff()only moves logits to CPU. As written, a CUDAencoder_outputcan escape into the transport/serialization path and either fail there or keep a large device allocation alive unexpectedly.Suggested fix
def get_diff(self) -> Diff: for i, context_logits in enumerate(self.diff.context_logits_list): self.diff.context_logits_list[i] = context_logits.to("cpu") for i, generation_logits in enumerate(self.diff.generation_logits_list): self.diff.generation_logits_list[i] = generation_logits.to("cpu") + if self.diff.encoder_output is not None: + self.diff.encoder_output = self.diff.encoder_output.to( + "cpu", non_blocking=True + ) return self.diffAlso applies to: 357-358, 423-425
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/llm_request.py` around lines 334 - 339, get_diff currently only moves logits to CPU but may leave raw CUDA encoder_output tensors inside Diff and other cross-rank payloads; update get_diff (and the analogous places around lines where encoder_output is added — e.g., the blocks referenced at 357-358 and 423-425) to normalize tensors before inserting into Diff: for each tensor (encoder_output, items in context_logits_list and generation_logits_list) call .detach().cpu().contiguous() (or .clone().detach().cpu() if contiguous not needed) to remove CUDA/device references and break autograd, and replace the originals in Diff with these CPU tensors so no CUDA memory or device-bound tensors are serialized or kept alive. Ensure you apply the same normalization routine wherever encoder_output or other tensors are appended to Diff.
649-679:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd the NVIDIA header to this modified Python file.
This module was updated in the PR but still lacks the required NVIDIA copyright header/current modification year.
As per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/llm_request.py` around lines 649 - 679, Add the required NVIDIA copyright header (with the current modification year) to the top of this Python source file (llm_request.py); place the standard NVIDIA header block before any imports or code (above the class/constructor that contains the shown super().__init__ calls and symbols like encoder_input_tokens, encoder_output_len, py_return_encoder_output) and update the copyright year to the current year.cpp/include/tensorrt_llm/batch_manager/capacityScheduler.h (1)
169-183:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate the copyright year in this modified header.
This header was changed for the PR, but its NVIDIA copyright line still ends at 2024.
As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/include/tensorrt_llm/batch_manager/capacityScheduler.h` around lines 169 - 183, The file capacityScheduler.h was modified but its NVIDIA copyright header still ends at 2024; update the copyright year to the current year (2026) in the header comment so the top-of-file NVIDIA copyright line reflects the modification (e.g., replace "2024" with "2026" or use a range like "2024-2026"), then save and commit the change.cpp/tensorrt_llm/thop/attentionOp.cpp (1)
679-699:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRequire encoder lengths on the cross-attention path.
cross_attention=trueis now accepted even whenencoder_input_lengthsis absent, butRunner::run()only populates the cross-attention length pointer when that tensor exists. That leaves the enqueue params under-specified for the new mode and turns a bad caller into wrong kernel bounds instead of a deterministic error.Suggested guard
TLLM_CHECK_WITH_INFO(is_mla_enable || is_fused_qkv || use_sage_attn || cross_attention, "Only fused QKV is supported for non-MLA non-SageAttention non-cross attention now"); TLLM_CHECK_WITH_INFO(update_kv_cache || cross_attention, "KV cache update cannot be disabled now"); + TLLM_CHECK_WITH_INFO( + !cross_attention || encoder_input_lengths.has_value(), + "encoder_input_lengths must be provided for cross attention"); + TLLM_CHECK_WITH_INFO( + !cross_attention || !update_kv_cache || cross_kv.has_value(), + "cross_kv must be provided when cross attention updates KV cache");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tensorrt_llm/thop/attentionOp.cpp` around lines 679 - 699, When cross_attention is true you must require encoder_input_lengths to be present to avoid creating under-specified enqueue params; add a runtime check in the attention op (near the existing TLLM_CHECK_WITH_INFO calls in the function that has parameters cross_attention and encoder_input_lengths) that asserts encoder_input_lengths.has_value() when cross_attention is true and emits a clear error via TLLM_CHECK_WITH_INFO (or similar) referencing cross_attention and encoder_input_lengths; this will surface a deterministic error (rather than letting Runner::run produce missing cross-attention length pointers and invalid kernel bounds).tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py (1)
60-89:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winInitialize
is_encoder_init_stateexplicitly on the request mocks.
KVCacheV2Scheduler._sort_requests()now branches onreq.is_encoder_init_state. On plainMockobjects, a missing attribute becomes a truthy child mock, so context requests get misclassified as encoder-init and the new chunk-order assertions can pass even if that logic regresses.Suggested fix
def make_ctx_request( @@ req.lora_task_id = lora_task_id req.is_context_init_state = True + req.is_encoder_init_state = False req.is_generation_in_progress_state = False req.encoder_output_len = encoder_output_len req.py_encoder_output_ready_event = None req.py_skip_cross_kv_projection = False return req def make_encoder_request(request_id, encoder_output_len, lora_task_id=None): @@ req.encoder_output_len = encoder_output_len req.lora_task_id = lora_task_id req.is_context_init_state = False + req.is_encoder_init_state = True req.is_generation_in_progress_state = False req.is_first_context_chunk = True return reqAlso applies to: 92-102
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py` around lines 60 - 89, The request mocks created by make_ctx_request are missing an explicit is_encoder_init_state attribute so Mock returns a truthy child mock and causes KVCacheV2Scheduler._sort_requests to misclassify context requests; update make_ctx_request to set req.is_encoder_init_state = False (and likewise set is_encoder_init_state explicitly in the other helper around lines 92-102) so encoder-init branching in _sort_requests behaves deterministically.tensorrt_llm/_torch/pyexecutor/_util.py (1)
1-25:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd the required NVIDIA file header.
This modified Python file still starts directly with imports, so it is missing the required NVIDIA copyright header/current modification year.
As per coding guidelines, "
**/*.{cpp,cc,cxx,h,hpp,py}: Read and follow CODING_GUIDELINES.md for all code changes (C++ and Python) Include NVIDIA copyright header on all new files; update year on modified files" and "**/*.{cpp,h,hpp,cu,cuh,py}: All C++, Python, and other source files must contain NVIDIA copyright header with current modification year".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 1 - 25, This file is missing the required NVIDIA copyright/header; add the official NVIDIA file header (with the current modification year) to the very top of tensorrt_llm/_torch/pyexecutor/_util.py before any imports so the header precedes the existing module-level imports and symbols (e.g., torch, tensorrt_llm, DecodingMode, MODEL_CLASS_VISION_ENCODER_MAPPING); if this is a modification of an existing file, update the year in the header to the current year.tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py (1)
1-16:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd the required NVIDIA file header.
This modified Python file still starts directly with imports, so it is missing the required NVIDIA copyright header/current modification year.
As per coding guidelines, "
**/*.{cpp,cc,cxx,h,hpp,py}: Read and follow CODING_GUIDELINES.md for all code changes (C++ and Python) Include NVIDIA copyright header on all new files; update year on modified files" and "**/*.{cpp,h,hpp,cu,cuh,py}: All C++, Python, and other source files must contain NVIDIA copyright header with current modification year".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py` around lines 1 - 16, Add the required NVIDIA copyright/header block to the very top of this Python module (above all imports) per CODING_GUIDELINES.md, updating the modification year to the current year; ensure the header matches the project's standard NVIDIA copyright template and includes the copyright owner and year range, then commit the file with the header in place (this file contains symbols like the module imports tensorrt_llm.bindings/internal, the import lines for LlmRequest and LlmRequestState, and references to CapacitySchedulerPolicy and logger to help locate the file).tensorrt_llm/executor/result.py (1)
1-1:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd required NVIDIA source header to this modified Python file.
This file was modified but still lacks the required NVIDIA copyright/license header.
Suggested patch
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + import asyncioAs per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/executor/result.py` at line 1, This Python module (tensorrt_llm.executor.result.py) is missing the required NVIDIA copyright/license header; add the standard NVIDIA source header (with the current modification year) at the very top of result.py before any code (e.g., before the existing "import asyncio") so the file complies with the project's C++/Python source header policy.cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)
2-3:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUpdate copyright year range on this modified file.
The file changed in this PR but the header still ends at 2025.
Suggested patch
- * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp` around lines 2 - 3, Update the copyright header's year range in the SPDX comments: change the SPDX-FileCopyrightText line that currently ends "2025" to include the current year (e.g., "2026") so the range reads "2022-2026" while leaving the SPDX-License-Identifier line unchanged.tests/unittest/_torch/executor/test_request_utils.py (1)
1-1:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd the required NVIDIA source header to this modified test file.
Even for unit-test Python sources, the repository rule requires the NVIDIA copyright/license header.
Suggested patch
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + """Tests for request_utils.py functions.As per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/executor/test_request_utils.py` at line 1, Add the required NVIDIA source header to the top of the modified test module test_request_utils.py by inserting the standard NVIDIA copyright/license header (with the current modification year) as the first thing in the file, before the existing module docstring and imports so the file complies with the repository’s source-header rule.
🟡 Minor comments (8)
legacy_enc_dec_architecture.md-123-123 (1)
123-123:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd language identifier to fenced code block.
The fenced code block should specify a language identifier for better rendering and to satisfy markdown linting rules.
📝 Proposed fix
-``` +```text out/<tpX>/encoder/rank0.engine, config.json out/<tpX>/decoder/rank0.engine, config.json</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@legacy_enc_dec_architecture.mdat line 123, The fenced code block at the end
of the file lacks a language identifier; update the block delimiter to include a
language (e.g., changetotext) so it becomes a labeled fenced code block
for proper rendering and linting—target the fenced block containing the lines
"out//encoder/rank0.engine, config.json" and
"out//decoder/rank0.engine, config.json" and add the language identifier.</details> </blockquote></details> <details> <summary>tests/unittest/llmapi/test_llm_args.py-342-353 (1)</summary><blockquote> `342-353`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Add boundary acceptance tests for `cross_kv_cache_fraction`.** These new tests cover invalid values, but not valid boundaries. Please also assert that `0.0`, `1.0`, and `None` are accepted. <details> <summary>🧪 Suggested test additions</summary> ```diff +@pytest.mark.parametrize("value", [0.0, 1.0, None]) +def test_KvCacheConfig_accepts_cross_kv_cache_fraction_boundaries(value): + cfg = KvCacheConfig(cross_kv_cache_fraction=value) + assert cfg.cross_kv_cache_fraction == value ``` </details> As per coding guidelines, "Coverage expectations: Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix." <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/llmapi/test_llm_args.py` around lines 342 - 353, Add positive-boundary tests for KvCacheConfig to ensure cross_kv_cache_fraction accepts 0.0, 1.0, and None: create new tests (e.g., test_KvCacheConfig_accepts_zero_fraction, test_KvCacheConfig_accepts_one_fraction, test_KvCacheConfig_accepts_none_fraction) that instantiate KvCacheConfig(cross_kv_cache_fraction=0.0), KvCacheConfig(cross_kv_cache_fraction=1.0), and KvCacheConfig(cross_kv_cache_fraction=None) respectively and assert the object is created (no exception) and the attribute equals the provided value; place these alongside the existing failing-value tests that reference KvCacheConfig so behavior is fully covered. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/inputs/data.py-38-45 (1)</summary><blockquote> `38-45`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **The public prompt types still require `prompt` / `prompt_token_ids`.** `prompt_inputs()` now accepts encoder-only dictionaries, but `TextPrompt` and `TokensPrompt` still model `prompt` / `prompt_token_ids` as required keys. That leaves the published type contract out of sync with the runtime API for encoder-decoder requests. Also applies to: 78-85 <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/inputs/data.py` around lines 38 - 45, TextPrompt and TokensPrompt currently require prompt / prompt_token_ids but prompt_inputs() accepts encoder-only dicts; update the prompt type definitions to make prompt and prompt_token_ids optional (NotRequired/Optional) to match encoder-decoder requests. Locate the TextPrompt and TokensPrompt type/class definitions and change the required keys for "prompt" and "prompt_token_ids" to optional (align with encoder_input_token_ids / encoder_inputs used elsewhere) so the public type contract matches runtime behavior; ensure any related docstrings/comments reflect the optionality. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/pyexecutor/resource_manager.py-704-732 (1)</summary><blockquote> `704-732`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Add the NVIDIA header to this modified Python file.** This file was changed in the PR but still starts without the required NVIDIA copyright header/current modification year. As per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year". <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 704 - 732, This Python source (tensorrt_llm/_torch/pyexecutor/resource_manager.py) is missing the required NVIDIA copyright header; add the standard NVIDIA copyright/header comment block (with the current modification year) at the very top of the file before any imports or code, ensuring it follows the project's header format used in other .py files; no functional code changes to symbols like CacheTypeCpp, kv_cache_type, add_sequence_batch, refresh_blocks, or class/method definitions are needed—just insert the header comment block exactly as mandated by the coding guidelines. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/modules/encoder_decoder_layer.py-29-36 (1)</summary><blockquote> `29-36`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Split abstract method declarations onto proper multi-line bodies to comply with PEP 8.** Both `forward` methods have one-line bodies that violate flake8 E704 ("multiple statements on one line"). Move the `...` ellipsis to its own line: <details> <summary>Suggested fix</summary> ```diff `@abstractmethod` def forward( self, hidden_states: torch.Tensor, attn_metadata: AttentionMetadata, position_ids: Optional[torch.IntTensor] = None, **kwargs, - ) -> torch.Tensor: ... + ) -> torch.Tensor: + ... @@ `@abstractmethod` def forward( self, position_ids: torch.IntTensor, hidden_states: torch.Tensor, attn_metadata: AttentionMetadata, encoder_hidden_states: Optional[torch.Tensor] = None, cross_attn_metadata: Optional[AttentionMetadata] = None, skip_cross_kv_projection: bool = False, **kwargs, - ) -> torch.Tensor: ... + ) -> torch.Tensor: + ... ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/modules/encoder_decoder_layer.py` around lines 29 - 36, The abstract forward method currently uses a one-line body with the ellipsis, violating flake8 E704; update the forward declaration in the class (the def forward(...) method) so the signature ends with a colon and the ellipsis is placed on its own indented line (i.e., split the method into a proper multi-line body) to satisfy PEP 8 and flake8 checks. ``` </details> </blockquote></details> <details> <summary>encoder_decoder_porting_guide.md-9-10 (1)</summary><blockquote> `9-10`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Update the top-level contract to match the implementation.** This still says enc-dec is “V2-only” and baseline beam width is `1`, but this PR adds V1 wiring/smoke coverage and explicitly calls out `beam_width > 1` generation support. Please align the guide’s opening contract so readers do not build against stale constraints. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@encoder_decoder_porting_guide.md` around lines 9 - 10, The top-level contract incorrectly states "V2-only" and beam width = 1; update the opening paragraph to reflect that encoder-decoder models (T5, BART, mBART) now include V1 wiring/smoke coverage and support generation with beam_width > 1. Change references to require use_kv_cache_manager_v2=True and KVCacheManagerV2 constraints to note that V2 is required for full enc-dec features (explicit SELF + CROSS pools) but V1 has limited wiring/smoke tests in this PR, and ensure the contract clearly documents supported beam_width behavior and runtime requirements (PyTorch vs. other runtimes). ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/pyexecutor/model_engine.py-4400-4402 (1)</summary><blockquote> `4400-4402`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **Consider converting assert to explicit RuntimeError for production safety.** The assertion on line 4400-4402 validates that the packed encoder length doesn't exceed `max_num_tokens`. While the assertion message is informative, assertions can be disabled with `-O` flag in production Python. Consider using an explicit check with `RuntimeError` for consistent error behavior: <details> <summary>🛡️ Suggested change</summary> ```diff - assert num_tokens <= self.max_num_tokens, ( - f"encoder packed length ({num_tokens}) exceeds max_num_tokens " - f"({self.max_num_tokens})") + if num_tokens > self.max_num_tokens: + raise RuntimeError( + f"encoder packed length ({num_tokens}) exceeds max_num_tokens " + f"({self.max_num_tokens}). Reduce batch size or sequence lengths.") ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 4400 - 4402, The assert checking packed encoder length should be replaced with an explicit runtime check to avoid being bypassed by Python -O; in the scope where num_tokens and self.max_num_tokens are available (the code block containing the current assert), change to an if statement that tests if num_tokens > self.max_num_tokens and raise RuntimeError with the same formatted message (include num_tokens and self.max_num_tokens) so the error is always raised in production. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py-449-450 (1)</summary><blockquote> `449-450`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_ **`PyMicroBatchScheduler.schedule()` bypasses the new readiness gate.** `_can_be_scheduled()` now rejects decoder-context requests until `py_encoder_output_ready_event` is set, but the scheduling loop re-inlines only the state-range check. A direct call to `schedule()` can still admit an unready `CONTEXT_INIT` request. Also applies to: 488-494 <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py` around lines 449 - 450, The scheduling loop in PyMicroBatchScheduler.schedule() bypasses the new decoder-context readiness gate, so update schedule() to consult the same logic as _can_be_scheduled(): before admitting a request (particularly CONTEXT_INIT / decoder-context requests) call is_decoder_context_request_ready(req) or call _can_be_scheduled(req) so that py_encoder_output_ready_event is respected; apply the same fix to the second occurrence referenced (the block around the other admission at lines ~488-494) so no CONTEXT_INIT request is scheduled until py_encoder_output_ready_event allows it. ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🧹 Nitpick comments (12)</summary><blockquote> <details> <summary>tests/unittest/llmapi/test_llm_args.py (1)</summary><blockquote> `342-353`: **QA list update is unnecessary for this test change.** This PR delta is unit-test scope in `tests/unittest/...`; no `tests/integration/defs/` entry changes are needed for QA scheduled lists. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/llmapi/test_llm_args.py` around lines 342 - 353, The change is only a unit-test update under tests/unittest and the QA scheduled lists should not be modified; revert or remove any edits in tests/integration/defs/ (or any QA list files) introduced by this PR, leaving the two new/updated tests (test_KvCacheConfig_rejects_cross_kv_cache_fraction_below_zero and test_KvCacheConfig_rejects_cross_kv_cache_fraction_above_one) and the KvCacheConfig-related assertions as the sole change set. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/llmapi/llm_args.py (1)</summary><blockquote> `2607-2617`: _⚡ Quick win_ **Prefer declarative field constraints for `cross_kv_cache_fraction`.** The custom validator at lines 2607–2616 duplicates constraint logic that belongs in the field definition. Pydantic v2 `Field(ge=0, le=1)` on `Optional[float]` enforces range validation on provided values while allowing `None`, matching the current validator behavior. This aligns with the codebase pattern (e.g., `free_gpu_memory_fraction` uses the same approach) and the coding guideline: "In Python Pydantic models, prefer `PositiveInt`, `NonNegativeInt`, `NonNegativeFloat`, `PositiveFloat`, or `Field(gt=0)` for numeric constraints instead of custom validators." <details> <summary>♻️ Proposed refactor</summary> ```diff cross_kv_cache_fraction: Optional[float] = Field( default=None, + ge=0, + le=1, description= "The fraction of the KV Cache memory should be reserved for cross attention. If set to p, self attention will use 1-p of KV Cache memory and cross attention will use p of KV Cache memory. Default is 50%. Should only be set when using encoder-decoder model." ) @@ - `@field_validator`('cross_kv_cache_fraction') - `@classmethod` - def validate_cross_kv_cache_fraction(cls, v: Optional[float]): - if v is None: - return v - if not 0 <= v <= 1: - raise ValueError( - "kv_cache_config.cross_kv_cache_fraction must be a float between 0 and 1" - ) - return v ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/llmapi/llm_args.py` around lines 2607 - 2617, The custom validator validate_cross_kv_cache_fraction duplicates range checks; remove that validator and declare the field cross_kv_cache_fraction as Optional[float] with Pydantic constraints (e.g., Field(ge=0, le=1)) in its definition so None is allowed but any provided value must be between 0 and 1, matching the pattern used for free_gpu_memory_fraction. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)</summary><blockquote> `77-80`: _⚡ Quick win_ **Declare `CROSS_KV_CACHE_MANAGER` as a normal enum member.** Using `locals()` here is just reflection for a plain enum constant. It makes the symbol harder to grep and reason about, and the direct declaration is clearer. <details> <summary>Suggested fix</summary> ```diff class ResourceManagerType(enum.Enum): KV_CACHE_MANAGER = "KV_CACHE_MANAGER" DRAFT_KV_CACHE_MANAGER = "DRAFT_KV_CACHE_MANAGER" - locals()["CR" + "OSS_KV_CACHE_MANAGER"] = "CR" + "OSS_KV_CACHE_MANAGER" + CROSS_KV_CACHE_MANAGER = "CROSS_KV_CACHE_MANAGER" PEFT_CACHE_MANAGER = "PEFT_CACHE_MANAGER" SEQ_SLOT_MANAGER = "SEQ_SLOT_MANAGER" SPEC_RESOURCE_MANAGER = "SPEC_RESOURCE_MANAGER" ``` </details> As per coding guidelines, "Python code should avoid reflection when functionality can be easily achieved without it". <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 77 - 80, The enum ResourceManagerType currently injects a member via reflection using locals()["CR" + "OSS_KV_CACHE_MANAGER"] = "CR" + "OSS_KV_CACHE_MANAGER"; replace that reflective line with a direct enum member declaration like CROSS_KV_CACHE_MANAGER = "CROSS_KV_CACHE_MANAGER" inside the ResourceManagerType class so the member is explicit and discoverable (update any references to ResourceManagerType.CROSS_KV_CACHE_MANAGER if needed). ``` </details> </blockquote></details> <details> <summary>tests/unittest/_torch/executor/test_encoder_step.py (1)</summary><blockquote> `406-427`: _⚡ Quick win_ **Add a beam-expanded case here.** `beam_width > 1` is a headline behavior of this PR, but this helper is only tested with one row per request. A regression where encoder lengths or cached-token counts are not replicated correctly per beam would still pass; a small parametrized case with duplicated beams and mixed cached/new cross-KV would close that gap. As per coding guidelines, "Coverage expectations: Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix ... Note missing negative tests, missing parametrization where multiple backends or dtypes apply." <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/executor/test_encoder_step.py` around lines 406 - 427, The test test_builds_metadata_for_mixed_projection_and_cached_sequences lacks a beam-expanded case; add a subcase where you call engine._prepare_encoder_decoder_cross_attention_inputs with beam_width > 1 (e.g., 2) and ensure inputs and metadata are validated for replicated beams: pass encoder_hidden_states duplicated per beam (or a batch with beam dimension), provide encoder_seq_lens and encoder_num_cached_tokens_per_seq expanded/repeated per beam, and assert that inputs["encoder_hidden_states"] points to the original tensor or its proper beam-expanded view, inputs["skip_cross_kv_projection"] remains False, inputs["cross_attn_metadata"] is metadata.cross_metadata with metadata.cross_metadata.prepared True, metadata.cross_kv_cache_manager equals the _FakeResourceManager's cross_manager, and that metadata.encoder_seq_lens and metadata.encoder_num_cached_tokens_per_seq have correct replicated values for each beam; reference test_builds_metadata_for_mixed_projection_and_cached_sequences, _prepare_encoder_decoder_cross_attention_inputs, _FakeAttentionMetadata, and _FakeResourceManager when making these additions. ``` </details> </blockquote></details> <details> <summary>tests/integration/defs/llmapi/test_llm_api_pytorch_t5.py (1)</summary><blockquote> `374-606`: **Consider wiring this into the QA functional list.** This is core single-node functional coverage for a new encoder-decoder flow, so it likely deserves an entry in `tests/integration/test_lists/qa/llm_function_core.txt` if you want it exercised in scheduled QA. From this file alone, a perf-list update does not look necessary. As per coding guidelines, "If the change adds or materially alters an integration test under tests/integration/defs/ ... call out whether an entry is needed under tests/integration/test_lists/qa/." <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/defs/llmapi/test_llm_api_pytorch_t5.py` around lines 374 - 606, The new encoder-decoder integration tests (test_t5_pytorch_generate_encoder_decoder_end_to_end and test_t5_pytorch_generate_encoder_decoder_cuda_graph_mixed_encoder_lengths_batch) should be added to the QA functional test list so they run under scheduled QA; update the QA list (llm_function_core.txt) to include a short entry referencing these tests (model encoder-decoder flow / T5 PyTorch backend) so the change under tests/integration/defs/... is explicitly exercised by the QA pipeline. ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/llmapi/llm.py (1)</summary><blockquote> `628-636`: _💤 Low value_ **Minor indentation style issue flagged by linter.** Static analysis flagged E121 (continuation line under-indented) at line 636. The closing parenthesis should align with the opening of the function call or maintain consistent indentation with the parameters. <details> <summary>🔧 Suggested fix for indentation</summary> ```diff (prompt_token_ids, prompt, query_token_ids, multimodal_params, encoder_input_token_ids) = self._preprocess( - inputs, - sampling_params, - disaggregated_params, - encoder_inputs=encoder_inputs, - encoder_input_token_ids=encoder_input_token_ids, - decoder_input_token_ids=decoder_input_token_ids, - ) + inputs, + sampling_params, + disaggregated_params, + encoder_inputs=encoder_inputs, + encoder_input_token_ids=encoder_input_token_ids, + decoder_input_token_ids=decoder_input_token_ids, + ) ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/llmapi/llm.py` around lines 628 - 636, The continuation call to self._preprocess is under-indented causing E121; fix by aligning the closing parenthesis with the opening parenthesis of the call or by indenting it to match the parameter block so the call looks consistently formatted; locate the call to _preprocess (the line starting with (prompt_token_ids, prompt, query_token_ids, multimodal_params, encoder_input_token_ids) = self._preprocess(...)) and adjust the indentation of the closing parenthesis and any continuation lines so they align with the opening self._preprocess( or match the indentation level used for the other multi-line calls in the file. ``` </details> </blockquote></details> <details> <summary>tests/unittest/_torch/modeling/test_modeling_enc_dec.py (1)</summary><blockquote> `684-688`: _💤 Low value_ **Consider extracting `_make_cross_attn` as a module-level helper.** The smoke benchmark instantiates `TestCrossAttentionTrtllmBackend()` just to call its `_make_cross_attn` method. This is functional but couples the smoke test to the test class structure. Consider moving `_make_cross_attn` to a module-level helper function (similar to `_build_trtllm_cross_metadata`) for cleaner reuse across test classes. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/modeling/test_modeling_enc_dec.py` around lines 684 - 688, Extract the TestCrossAttentionTrtllmBackend._make_cross_attn method into a module-level helper function (e.g., make_cross_attn) and update callers to use that helper instead of instantiating TestCrossAttentionTrtllmBackend; specifically, move the implementation out of the class into a top-level function alongside helpers like _build_trtllm_cross_metadata, keep the same parameters (hidden_size, num_heads, head_dim, dtype, device), preserve behavior and returned tensor/device handling, and replace usages such as TestCrossAttentionTrtllmBackend()._make_cross_attn(...) with make_cross_attn(...). ``` </details> </blockquote></details> <details> <summary>tensorrt_llm/_torch/pyexecutor/model_engine.py (2)</summary><blockquote> `1872-1905`: _💤 Low value_ **Consider extracting the nested `update_cross_metadata` function.** The nested `update_cross_metadata` function directly modifies internal attributes like `_seq_lens` and `_seq_lens_cuda`. While this works, it couples tightly to `AttentionMetadata` internals. Additionally, the function captures variables from the enclosing scope (`attn_metadata`, `encoder_seq_lens_tensor`, etc.) which makes the data flow less explicit. Consider whether this could be a method on `AttentionMetadata` itself or at minimum accept all dependencies as parameters for clarity. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 1872 - 1905, The nested function update_cross_metadata tightly couples to AttentionMetadata internals and captures external variables (attn_metadata, encoder_seq_lens_tensor, encoder_num_cached_tokens_per_seq, cross_kv_cache_manager), making data flow implicit; refactor by extracting update_cross_metadata into either an instance/class method on AttentionMetadata (e.g., AttentionMetadata.update_for_cross(attn_metadata, encoder_seq_lens_tensor, encoder_num_cached_tokens_per_seq, cross_kv_cache_manager)) or a standalone function that takes all dependencies as explicit parameters, and move creation of KVCacheParams (currently constructed inside update_cross_metadata) into that method so it only mutates public attributes (avoid direct writes to _seq_lens/_seq_lens_cuda) and document required inputs. ``` </details> --- `4468-4477`: _💤 Low value_ **Consider using `_get_top_level_model()` for consistency with the embedding lookup pattern.** The encoder discovery at lines 4468-4472 retrieves the encoder directly from `self.model` and `self.model.model`, but the embedding lookup at line 4484 uses `_get_top_level_model()` to unwrap compiled models. For consistency and to guard against potential edge cases with `torch.compile` wrapped submodules, refactor the encoder discovery to use the same `_get_top_level_model()` pattern. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 4468 - 4477, The encoder lookup currently reads encoder from self.model and self.model.model directly which can miss wrapped/compiled submodules; change the discovery to first call _get_top_level_model(self.model) (the same helper used by the embedding lookup) and then get the "encoder" attribute (and fallback to inner.model.encoder if necessary) so encoder resolution mirrors the embedding pattern—update the block that references encoder and inner to use _get_top_level_model(...) and retain the existing AttributeError message if no encoder is found. ``` </details> </blockquote></details> <details> <summary>tests/unittest/_torch/executor/test_py_scheduler.py (1)</summary><blockquote> `2506-2516`: _⚡ Quick win_ **Add the constrained cross-pool case for `MAX_UTILIZATION`.** This test uses a roomy cross pool, so it still passes if `ENCODER_INIT` is accidentally charged against cross-KV blocks in the max-utilization path. Using something like `MockKVCacheManager(num_free_blocks=1, blocks_per_request=2)` here would mirror the guaranteed-no-evict regression case and actually prove encoder admission is cross-pool-neutral under both policies. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/executor/test_py_scheduler.py` around lines 2506 - 2516, The test test_max_utilization_admits_encoder_with_cross_pool currently uses a roomy cross pool; change the cross_kv MockKVCacheManager instantiation to mimic the constrained cross-pool case (e.g., MockKVCacheManager(num_free_blocks=1, blocks_per_request=2)) so the test verifies MAX_UTILIZATION does not incorrectly charge ENCODER_INIT to cross-KV blocks; keep the same requests (make_encoder_request and make_generation_request), call scheduler.schedule_request, and retain the existing assertions on fitting and paused to ensure encoder admission is cross-pool-neutral under this policy. ``` </details> </blockquote></details> <details> <summary>tests/unittest/llmapi/test_encoder_decoder_request_api.py (1)</summary><blockquote> `77-82`: _⚡ Quick win_ **Add one behavioral `BaseLLM.generate` test, not just a signature check.** The async path is covered end-to-end, but the sync wrapper only gets a parameter-order assertion. If `generate()` drops `encoder_input_token_ids` or handles the conflict checks differently from `generate_async()`, this file stays green. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/llmapi/test_encoder_decoder_request_api.py` around lines 77 - 82, The test currently only asserts parameter order for BaseLLM.generate; add a behavioral unit test that calls BaseLLM.generate (or a small test subclass/mocked LLM implementing generate_async) to verify that when both priority and encoder_input_token_ids/encoder_inputs kwargs are provided, generate forwards encoder inputs and enforces the same conflict/priority behavior as generate_async — specifically instantiate a fake LLM where generate_async records received kwargs, call the synchronous BaseLLM.generate wrapper with both priority and encoder_input_token_ids/encoder_inputs, and assert the recorded args include the encoder inputs (and that priority wins/behavior matches generate_async) to exercise the actual code paths rather than just signatures. ``` </details> </blockquote></details> <details> <summary>tests/unittest/_torch/executor/test_request_utils.py (1)</summary><blockquote> `64-109`: **QA list update is not needed for this test change.** This is unit-test scope only (`tests/unittest/**`), so integration/release QA list edits are unnecessary in this PR. <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/executor/test_request_utils.py` around lines 64 - 109, This PR only changes unit tests (e.g., test_executor_request_to_llm_request_preserves_encoder_tokens and test_scheduled_requests_does_not_query_encoder_context_chunk_state using ScheduledRequests/EncoderInitRequest), so remove any unrelated QA list / integration/release QA edits from the commit and keep only the test file modifications; ensure the diff contains solely the test changes and no QA list or release checklist updates before finalizing the PR. ``` </details> </blockquote></details> </blockquote></details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: CHILL **Plan**: Enterprise **Run ID**: `be032c9d-11c9-4f74-b7b7-0350127c8b65` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between f8572abec4bf623dc142d7df8babc67ab1616996 and e4752c379acc0e88142ba6d285b588b3e869be7b. </details> <details> <summary>📒 Files selected for processing (48)</summary> * `cpp/include/tensorrt_llm/batch_manager/capacityScheduler.h` * `cpp/tensorrt_llm/batch_manager/capacityScheduler.cpp` * `cpp/tensorrt_llm/common/attentionOp.cpp` * `cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp` * `cpp/tensorrt_llm/nanobind/thop/bindings.cpp` * `cpp/tensorrt_llm/thop/attentionOp.cpp` * `cpp/tensorrt_llm/thop/attentionOp.h` * `cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp` * `encoder_decoder_porting_guide.md` * `legacy_enc_dec_architecture.md` * `tensorrt_llm/_torch/attention_backend/interface.py` * `tensorrt_llm/_torch/attention_backend/trtllm.py` * `tensorrt_llm/_torch/attention_backend/trtllm_gen.py` * `tensorrt_llm/_torch/attention_backend/vanilla.py` * `tensorrt_llm/_torch/model_config.py` * `tensorrt_llm/_torch/models/__init__.py` * `tensorrt_llm/_torch/models/modeling_bart.py` * `tensorrt_llm/_torch/models/modeling_t5.py` * `tensorrt_llm/_torch/modules/cross_attention.py` * `tensorrt_llm/_torch/modules/encoder_decoder_layer.py` * `tensorrt_llm/_torch/modules/rms_norm.py` * `tensorrt_llm/_torch/pyexecutor/_util.py` * `tensorrt_llm/_torch/pyexecutor/llm_request.py` * `tensorrt_llm/_torch/pyexecutor/model_engine.py` * `tensorrt_llm/_torch/pyexecutor/model_loader.py` * `tensorrt_llm/_torch/pyexecutor/py_executor.py` * `tensorrt_llm/_torch/pyexecutor/resource_manager.py` * `tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py` * `tensorrt_llm/_torch/pyexecutor/scheduler/scheduler_v2.py` * `tensorrt_llm/executor/base_worker.py` * `tensorrt_llm/executor/executor.py` * `tensorrt_llm/executor/request.py` * `tensorrt_llm/executor/result.py` * `tensorrt_llm/inputs/data.py` * `tensorrt_llm/llmapi/llm.py` * `tensorrt_llm/llmapi/llm_args.py` * `tests/integration/defs/llmapi/test_llm_api_pytorch_t5.py` * `tests/unittest/_torch/executor/test_dual_pool_kv_cache.py` * `tests/unittest/_torch/executor/test_encoder_step.py` * `tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py` * `tests/unittest/_torch/executor/test_py_scheduler.py` * `tests/unittest/_torch/executor/test_request_utils.py` * `tests/unittest/_torch/modeling/test_modeling_enc_dec.py` * `tests/unittest/_torch/test_model_config.py` * `tests/unittest/api_stability/references/llm.yaml` * `tests/unittest/api_stability/references/request_output.yaml` * `tests/unittest/llmapi/test_encoder_decoder_request_api.py` * `tests/unittest/llmapi/test_llm_args.py` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| if encoder_num_cached_tokens_per_seq is not None: | ||
| from ..metadata import KVCacheParams | ||
| base_params = self.kv_cache_params | ||
| cross_md.kv_cache_params = KVCacheParams( | ||
| use_cache=base_params.use_cache if base_params is not None else | ||
| (cross_kv_cache_manager is not None), | ||
| num_cached_tokens_per_seq=list( | ||
| encoder_num_cached_tokens_per_seq), | ||
| block_ids_per_seq=base_params.block_ids_per_seq | ||
| if base_params is not None else None, | ||
| host_max_attention_window_sizes=base_params. | ||
| host_max_attention_window_sizes | ||
| if base_params is not None else None, | ||
| host_sink_token_length=base_params.host_sink_token_length | ||
| if base_params is not None else None, | ||
| num_extra_kv_tokens=base_params.num_extra_kv_tokens | ||
| if base_params is not None else 0, | ||
| ) | ||
| cross_md.__post_init__() |
There was a problem hiding this comment.
None does not currently default cross cached-token counts to zero.
create_cross_metadata() documents that omitting encoder_num_cached_tokens_per_seq means “nothing cached yet”, but in that branch cross_md keeps the shallow-copied self.kv_cache_params. On generation steps that can reuse the decoder cached-token counts for the cross pool and corrupt cross-attention cache accounting.
Suggested fix
- if encoder_num_cached_tokens_per_seq is not None:
- from ..metadata import KVCacheParams
- base_params = self.kv_cache_params
- cross_md.kv_cache_params = KVCacheParams(
- use_cache=base_params.use_cache if base_params is not None else
- (cross_kv_cache_manager is not None),
- num_cached_tokens_per_seq=list(
- encoder_num_cached_tokens_per_seq),
- block_ids_per_seq=base_params.block_ids_per_seq
- if base_params is not None else None,
- host_max_attention_window_sizes=base_params.
- host_max_attention_window_sizes
- if base_params is not None else None,
- host_sink_token_length=base_params.host_sink_token_length
- if base_params is not None else None,
- num_extra_kv_tokens=base_params.num_extra_kv_tokens
- if base_params is not None else 0,
- )
+ from ..metadata import KVCacheParams
+ base_params = self.kv_cache_params
+ cached_tokens = (
+ list(encoder_num_cached_tokens_per_seq)
+ if encoder_num_cached_tokens_per_seq is not None
+ else [0] * int(encoder_seq_lens.numel())
+ )
+ cross_md.kv_cache_params = KVCacheParams(
+ use_cache=base_params.use_cache if base_params is not None else
+ (cross_kv_cache_manager is not None),
+ num_cached_tokens_per_seq=cached_tokens,
+ block_ids_per_seq=base_params.block_ids_per_seq
+ if base_params is not None else None,
+ host_max_attention_window_sizes=base_params.
+ host_max_attention_window_sizes
+ if base_params is not None else None,
+ host_sink_token_length=base_params.host_sink_token_length
+ if base_params is not None else None,
+ num_extra_kv_tokens=base_params.num_extra_kv_tokens
+ if base_params is not None else 0,
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/attention_backend/interface.py` around lines 457 - 475,
The branch that handles encoder_num_cached_tokens_per_seq leaves cross_md using
the shallow-copied self.kv_cache_params when encoder_num_cached_tokens_per_seq
is None, so cross cached-token counts should instead be explicitly initialized
to "nothing cached yet" (zeros) to avoid reusing decoder counts; modify
create_cross_metadata so that when encoder_num_cached_tokens_per_seq is None you
construct a new KVCacheParams (like you do in the not-None branch) based on
base_params/self.kv_cache_params but override num_cached_tokens_per_seq with a
list of zeros of the appropriate length (e.g.,
len(base_params.num_cached_tokens_per_seq) if base_params is present, otherwise
an empty list or zeros based on expected sequence count), preserving other
fields (use_cache, block_ids_per_seq, host_max_attention_window_sizes,
host_sink_token_length, num_extra_kv_tokens) and then call
cross_md.__post_init__.
| # Cross-attention generation steps legitimately need | ||
| # update_kv_cache=False (encoder K/V are already cached). | ||
| if not update_kv_cache and not cross_attention: | ||
| return False, "KV cache update cannot be disabled now." |
There was a problem hiding this comment.
Keep the update_kv_cache=False exemption generation-only.
This relaxation is correct for cross-attention decode, but is_supported() still defaults to phase="both". As written, cross_attention=True + update_kv_cache=False is also accepted for context/both-phase checks, even though cross-attention context is where we populate the cross cache from cross_kv_input.
Suggested fix
- # Cross-attention generation steps legitimately need
- # update_kv_cache=False (encoder K/V are already cached).
- if not update_kv_cache and not cross_attention:
+ # Cross-attention decode legitimately needs update_kv_cache=False
+ # because encoder K/V are already cached. Context/prefill still has
+ # to materialize those K/V into the cross cache.
+ allow_disabled_kv_update = cross_attention and phase == "generation"
+ if not update_kv_cache and not allow_disabled_kv_update:
return False, "KV cache update cannot be disabled now."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/attention_backend/trtllm_gen.py` around lines 164 - 167,
The current check in is_supported() allows update_kv_cache=False whenever
cross_attention is True, which incorrectly permits that exemption for
non-generation phases (phase defaults to "both"); change the condition so
update_kv_cache can only be disabled when cross_attention is True AND phase ==
"generate". Concretely, update the if guarding update_kv_cache/cross_attention
to something like: if not update_kv_cache and not (cross_attention and phase ==
"generate"): return False, "...", so the exemption is generation-only; reference
symbols: is_supported(), update_kv_cache, cross_attention, phase.
| encoder_seq_lens = (metadata.kv_lens_cuda_runtime | ||
| if metadata.is_cross else None) |
There was a problem hiding this comment.
Don’t use total cached KV lengths for context-side cross_kv_input.
In mixed reuse batches, metadata.kv_lens_cuda_runtime includes cached cross-KV, but cross_kv_input only contains the freshly projected encoder rows. For a batch like encoder_seq_lens=[2,0,0] and cached lengths [0,5,7], this passes lengths [2,5,7] alongside a 2-row cross_kv_input, so the C++ path overestimates the append geometry. Use the raw projection lengths when update_kv_cache is true, and reserve kv_lens_* for cache-read-only steps.
Also applies to: 1474-1505
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/attention_backend/trtllm.py` around lines 1351 - 1352,
The code currently sets encoder_seq_lens = metadata.kv_lens_cuda_runtime when
metadata.is_cross, but in mixed reuse batches this includes cached cross-KV rows
and overestimates lengths for the freshly-projected cross_kv_input; change the
logic in trtllm.py where encoder_seq_lens is computed so that when
metadata.is_cross and metadata.update_kv_cache is true (i.e., we are appending
new projected encoder rows) you use the raw projection lengths (the per-example
projection lengths source used to build cross_kv_input) instead of
metadata.kv_lens_cuda_runtime, and only use metadata.kv_lens_cuda_runtime /
kv_lens_* when doing cache-read-only steps (update_kv_cache is false); update
the analogous block(s) referenced around encoder_seq_lens and the later repeated
section (the area corresponding to the second occurrence ~1474-1505) to apply
the same conditional.
| @@ -1,6 +1,8 @@ | |||
| import transformers | |||
There was a problem hiding this comment.
Add the required NVIDIA source header.
This modified Python source file is missing the required copyright/license header.
Suggested patch
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
import transformersAs per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import transformers | |
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| import transformers |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/models/__init__.py` at line 1, This file is missing the
required NVIDIA source/license header; add the standard NVIDIA copyright/license
header (with the current modification year) at the very top of the Python module
before the existing import transformers statement so the header appears above
all code in __init__.py.
| if position_ids is not None: | ||
| hidden_states = hidden_states + self.embed_positions(position_ids) | ||
| hidden_states = self.layernorm_embedding(hidden_states) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd "modeling_bart.py" --type fRepository: NVIDIA/TensorRT-LLM
Length of output: 107
🏁 Script executed:
cat -n tensorrt_llm/_torch/models/modeling_bart.py | sed -n '345,360p; 390,405p; 428,440p'Repository: NVIDIA/TensorRT-LLM
Length of output: 2163
🌐 Web query:
Hugging Face BART positional embeddings offset HF embeddings reserved rows
💡 Result:
In Hugging Face's Transformers library implementation of BART, the BartLearnedPositionalEmbedding class uses a hardcoded self.offset = 2[1][2][3]. This offsets the position indices before looking up embeddings in the nn.Embedding layer, which is sized num_embeddings + 2 (e.g., 1024 + 2 = 1026 for max_position_embeddings=1024)[2][4]. The offset reserves the first two embedding rows (indices 0 and 1) and is a BART-specific hack to align with pretraining conventions[1][2]. BART's pad_token_id is 1[5], and historically, avoiding padding_idx in nn.Embedding prevented zeroing out learnable positions (e.g., position 1 would be untrainable if set as padding_idx)[6]. Instead of using padding_idx, positions start from 2: a sequence position 0 maps to embedding index 2, position 1 to 3, etc.[1][6]. This ensures compatibility with pretrained BART checkpoints (e.g., facebook/bart-base, facebook/bart-large) without losing positional information at low indices[6]. The comment explicitly states: "Bart is set up so that if padding_idx is specified then offset the embedding ids by 2 and adjust num_embeddings appropriately. Other models don't have this hack"[1][2]. No evidence of changes post-2023; the main branch still uses offset=2 as of 2026[2].
Citations:
- 1: self.offset=2 in Bart position_embedding huggingface/transformers#22359
- 2: https://github.com/huggingface/transformers/blob/main/src/transformers/models/bart/modeling_bart.py
- 3: https://huggingface.co/transformers/v4.11.1/_modules/transformers/models/bart/modeling_bart.html
- 4: https://medium.com/@nadirapovey/bart-model-architecture-8ac1cea0e877
- 5: https://github.com/huggingface/transformers/blob/main/src/transformers/models/bart/configuration_bart.py
- 6: Current BART Position Embeddings Implementation Seems Wrong huggingface/transformers#19240
Apply +2 positional-embedding offset before lookup in encoder and decoder.
HF BART/mBART reserve rows 0 and 1 in embed_positions; actual token position 0 lives at index 2. The code correctly stores position_id_offset = 2 (line 435) with an explanatory comment but never applies this offset when accessing the embeddings. Both encoder (lines 351–353) and decoder (lines 396–398) index with raw position_ids, so loading pretrained checkpoints will read misaligned rows and produce incorrect positional embeddings.
Minimal fix
if position_ids is not None:
- hidden_states = hidden_states + self.embed_positions(position_ids)
+ hidden_states = hidden_states + self.embed_positions(position_ids + 2)Apply the same fix at both lines 352 and 397.
Also applies to: encoder (line 352), decoder (line 397)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/models/modeling_bart.py` around lines 351 - 353, The
positional-embedding lookup uses raw position_ids but HF BART reserves rows 0–1,
so apply the stored offset before indexing: when adding embeddings in the
encoder and decoder (where you do hidden_states = hidden_states +
self.embed_positions(position_ids)), replace position_ids with position_ids +
position_id_offset (use the existing position_id_offset = 2 variable) so
embed_positions is indexed with position_ids + position_id_offset; update both
occurrences (the encoder use around the embed_positions call and the decoder use
around its embed_positions call) and keep the subsequent layernorm_embedding
calls unchanged.
| try: | ||
| self.encoder_stream.wait_stream(torch.cuda.current_stream()) | ||
| with torch.cuda.stream(self.encoder_stream): | ||
| encoder_hidden_states, encoder_seq_lens = ( | ||
| self.model_engine.forward_encoder( | ||
| encoder_requests, | ||
| resource_manager=self.resource_manager, | ||
| )) | ||
| except Exception as e: | ||
| traceback.print_exc() | ||
| error_msg = str(e) | ||
| logger.error( | ||
| f"Encountered an error in encoder forward: {error_msg}") | ||
| self._handle_errors(error_msg, requests=encoder_requests) | ||
| return |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Get the file and examine the encoder forward section and _handle_errors method
wc -l tensorrt_llm/_torch/pyexecutor/py_executor.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 114
🏁 Script executed:
# Read the encoder forward section
sed -n '3260,3330p' tensorrt_llm/_torch/pyexecutor/py_executor.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 3202
🏁 Script executed:
# Find the _handle_errors method definition to understand its scope
rg -n "_handle_errors" tensorrt_llm/_torch/pyexecutor/py_executor.py | head -20Repository: NVIDIA/TensorRT-LLM
Length of output: 905
🏁 Script executed:
# Search for other exception handling patterns in the file to see how errors are typically caught
rg -n "except.*:" tensorrt_llm/_torch/pyexecutor/py_executor.py | head -30Repository: NVIDIA/TensorRT-LLM
Length of output: 662
🏁 Script executed:
# Read the _handle_errors method definition
sed -n '3935,3970p' tensorrt_llm/_torch/pyexecutor/py_executor.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1995
🏁 Script executed:
# Search for forward_encoder definition and its docstring
rg -B 5 -A 20 "def forward_encoder" tensorrt_llm/_torch/pyexecutor/py_executor.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Compare with decoder forward error handling (look for _run_decoder_step or similar)
rg -B 5 -A 25 "def _run_decoder_step" tensorrt_llm/_torch/pyexecutor/py_executor.py | head -50Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Look at the decoder forward call to see if it also has broad exception handling
sed -n '3197,3280p' tensorrt_llm/_torch/pyexecutor/py_executor.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 4066
🏁 Script executed:
# Search for ModelEngine class and its forward_encoder method to understand what exceptions it raises
rg -r "def forward_encoder" --type pyRepository: NVIDIA/TensorRT-LLM
Length of output: 460
🏁 Script executed:
# Let's also check if there are docstrings or type hints for forward_encoder
rg -B 3 -A 15 "def forward_encoder" --type pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1760
🏁 Script executed:
# Check what exceptions are documented as being raised from encoder operations
rg "raises?" --type py tensorrt_llm/_torch/pyexecutor/py_executor.py | head -20Repository: NVIDIA/TensorRT-LLM
Length of output: 745
🏁 Script executed:
# Look at other decoder forward calls to see the error handling pattern for comparison
sed -n '3100,3180p' tensorrt_llm/_torch/pyexecutor/py_executor.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 4353
🏁 Script executed:
# Check the full forward_encoder docstring and implementation
sed -n '1,100p' tensorrt_llm/_torch/pyexecutor/model_engine.py | head -50Repository: NVIDIA/TensorRT-LLM
Length of output: 2422
🏁 Script executed:
# Find where forward_encoder is defined and read its full docstring
fd model_engine.py --type fRepository: NVIDIA/TensorRT-LLM
Length of output: 170
🏁 Script executed:
# Get the full forward_encoder implementation to see what exceptions it may raise
rg -A 40 "def forward_encoder" tensorrt_llm/_torch/pyexecutor/model_engine.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1885
🏁 Script executed:
# Check if there are specific exceptions documented or commonly raised in encoder paths
rg -B 2 -A 2 "AssertionError|RuntimeError|AttributeError" tensorrt_llm/_torch/pyexecutor/model_engine.py | head -40Repository: NVIDIA/TensorRT-LLM
Length of output: 1785
🏁 Script executed:
# Let's see what exceptions are documented or raised in _prepare_tp_inputs_encoder and _forward_step_encoder
rg -A 30 "def _prepare_tp_inputs_encoder" tensorrt_llm/_torch/pyexecutor/model_engine.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1408
🏁 Script executed:
# Check what exceptions forward() (decoder) catches for comparison
rg -B 5 -A 20 "def forward\(" tensorrt_llm/_torch/pyexecutor/model_engine.py | head -80Repository: NVIDIA/TensorRT-LLM
Length of output: 2415
🏁 Script executed:
# Look at how decoder forward errors are handled in py_executor
sed -n '3654,3700p' tensorrt_llm/_torch/pyexecutor/py_executor.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 2261
🏁 Script executed:
# Check the CODING_GUIDELINES to confirm the exact requirement
cat tensorrt_llm/CODING_GUIDELINES.md | grep -A 5 "broad exception"Repository: NVIDIA/TensorRT-LLM
Length of output: 129
🏁 Script executed:
# Find the CODING_GUIDELINES file
find . -name "*CODING*" -type fRepository: NVIDIA/TensorRT-LLM
Length of output: 86
🏁 Script executed:
# Look at how decoder forward errors are typically handled - check other broad Exception catches
sed -n '3800,3850p' tensorrt_llm/_torch/pyexecutor/py_executor.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 2861
🏁 Script executed:
# Compare exception handling in decoder forward vs encoder forward
sed -n '2415,2435p' tensorrt_llm/_torch/pyexecutor/py_executor.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1051
🏁 Script executed:
# Check if there are any comments or documentation about expected exceptions
rg "expected.*exception|raise.*error|error.*handling" tensorrt_llm/_torch/pyexecutor/model_engine.py | head -20Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Check if the decoder forward at line 3800 also handles it differently or uses charge_budget=False
sed -n '3785,3815p' tensorrt_llm/_torch/pyexecutor/py_executor.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1569
🏁 Script executed:
# Check line 2108 where _handle_errors is called with charge_budget
sed -n '2100,2115p' tensorrt_llm/_torch/pyexecutor/py_executor.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 1021
🏁 Script executed:
# Search for uses of charge_budget=False to understand expected pattern
rg "charge_budget.*False" tensorrt_llm/_torch/pyexecutor/py_executor.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 432
🏁 Script executed:
# Check what specific exceptions _prepare_tp_inputs_encoder might raise
sed -n '1,150p' tensorrt_llm/_torch/pyexecutor/model_engine.py | grep -A 100 "_prepare_tp_inputs_encoder" | head -80Repository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Verify the specific exceptions that should be caught by examining what forward_encoder actually raises
rg -A 80 "def _forward_step_encoder" tensorrt_llm/_torch/pyexecutor/model_engine.py | head -100Repository: NVIDIA/TensorRT-LLM
Length of output: 3740
🏁 Script executed:
# Check the coding guideline file content
sed -n '1,50p' CODING_GUIDELINES.mdRepository: NVIDIA/TensorRT-LLM
Length of output: 2714
🏁 Script executed:
# Search for the specific guideline about exception handling
rg "Avoid broad exception" CODING_GUIDELINES.mdRepository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Search for the Python exception handling guideline
rg -A 3 "Avoid broad exception handling" CODING_GUIDELINES.mdRepository: NVIDIA/TensorRT-LLM
Length of output: 45
🏁 Script executed:
# Search for all Python-specific guidelines about exceptions
rg -B 2 -A 3 "exception|except" CODING_GUIDELINES.md | head -40Repository: NVIDIA/TensorRT-LLM
Length of output: 3310
🏁 Script executed:
# Get the specific section about exception handling in Python
sed -n '200,400p' CODING_GUIDELINES.mdRepository: NVIDIA/TensorRT-LLM
Length of output: 7446
🏁 Script executed:
# Search for Python exception handling guidelines more broadly
sed -n '1,2000p' CODING_GUIDELINES.md | grep -A 5 -B 2 "exception\|except" | head -60Repository: NVIDIA/TensorRT-LLM
Length of output: 4031
🏁 Script executed:
# Check if there's a specific Python section in the guidelines
grep -n "^#.*Python" CODING_GUIDELINES.mdRepository: NVIDIA/TensorRT-LLM
Length of output: 120
🏁 Script executed:
# Get the Python section directly
sed -n '/^## Python/,/^##/p' CODING_GUIDELINES.md | head -200Repository: NVIDIA/TensorRT-LLM
Length of output: 209
🏁 Script executed:
# Get the full Python section
sed -n '479,800p' CODING_GUIDELINES.mdRepository: NVIDIA/TensorRT-LLM
Length of output: 13333
Catch only specific, request-scoped exceptions in the encoder forward path.
The broad except Exception at line 3273 violates the coding guideline "When using try-except blocks, limit the except to the smallest set of errors possible" and masks internal errors like AttributeError (missing encoder submodule) that should propagate to fail fast. Catch only ValueError for invalid input errors and let other exceptions (RuntimeError, AttributeError, etc.) propagate so the executor detects corrupted state.
🧰 Tools
🪛 Ruff (0.15.12)
[warning] 3273-3273: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 3265 - 3279,
Replace the broad "except Exception" around the encoder forward block so only
input-validation errors are caught: change the handler to "except ValueError as
e:" (keeping traceback.print_exc(), logger.error(f'Encountered an error in
encoder forward: {error_msg}'), calling self._handle_errors(error_msg,
requests=encoder_requests) and returning) and remove/avoid catching other
exception types so RuntimeError/AttributeError and other internal faults from
self.model_engine.forward_encoder, self.encoder_stream, or missing submodules
propagate and fail fast; keep references to encoder_stream,
model_engine.forward_encoder, resource_manager, encoder_requests, and
_handle_errors to locate and implement the change.
| assert len(encoder_seq_lens) == len(encoder_requests), ( | ||
| "Encoder packed sequence lengths must match the number of " | ||
| "encoder requests") | ||
| assert encoder_hidden_states.shape[0] == sum(encoder_seq_lens), ( | ||
| "Encoder packed hidden states first dim must equal " | ||
| "sum(encoder_seq_lens)") | ||
|
|
||
| offset = 0 | ||
| for req, seq_len in zip(encoder_requests, encoder_seq_lens): | ||
| req.py_encoder_output = encoder_hidden_states[offset:offset + | ||
| seq_len] | ||
| req.py_skip_cross_kv_projection = False | ||
| req.state = LlmRequestState.CONTEXT_INIT | ||
| offset += seq_len |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and examine the lines in question
file_path="tensorrt_llm/_torch/pyexecutor/py_executor.py"
if [ -f "$file_path" ]; then
echo "=== File found, checking line count ==="
wc -l "$file_path"
echo ""
echo "=== Code at lines 3316-3329 ==="
sed -n '3316,3329p' "$file_path"
else
echo "File not found: $file_path"
fiRepository: NVIDIA/TensorRT-LLM
Length of output: 897
🏁 Script executed:
# Check Python version requirements in the project
echo "=== Checking setup.py or pyproject.toml for Python version requirements ==="
if [ -f "setup.py" ]; then
grep -A5 -B5 "python_requires\|Programming Language" setup.py | head -20
fi
if [ -f "pyproject.toml" ]; then
grep -A2 -B2 "python\|requires-python" pyproject.toml | head -20
fi
echo ""
echo "=== Checking CODING_GUIDELINES.md ==="
grep -i "python.*3\.[0-9]" CODING_GUIDELINES.md || echo "No version requirement found in CODING_GUIDELINES"Repository: NVIDIA/TensorRT-LLM
Length of output: 1835
🏁 Script executed:
# Verify that zip(strict=True) is indeed a Python 3.10+ feature
python3 << 'EOF'
import sys
print(f"Python version: {sys.version}")
print(f"Python version info: {sys.version_info}")
# Check if strict parameter exists in zip
import inspect
sig = inspect.signature(zip)
print(f"zip() signature: {sig}")
print(f"zip() has 'strict' parameter: {'strict' in sig.parameters}")
EOFRepository: NVIDIA/TensorRT-LLM
Length of output: 864
🏁 Script executed:
# Test if strict=True works in current Python version and check documentation
python3 << 'EOF'
import sys
# Test if strict parameter works
try:
result = list(zip([1, 2, 3], [4, 5], strict=True))
print("strict=True parameter works in this Python version")
print(f"Result: {result}")
except TypeError as e:
print(f"strict=True parameter NOT supported: {e}")
# Check help for zip
print("\n=== zip() documentation ===")
help(zip)
EOFRepository: NVIDIA/TensorRT-LLM
Length of output: 192
🏁 Script executed:
# Check Python version history for zip strict parameter
python3 << 'EOF'
import sys
print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
print(f"Python 3.10 released: October 2021")
print(f"Python 3.10 (initial): strict parameter added to zip()")
print()
# Verify in actual Python 3.10+ documentation or WHATSNEW file if available
# The strict parameter was indeed added in Python 3.10
print("According to Python documentation:")
print("- Python 3.10: zip() has strict parameter")
print("- Source: https://docs.python.org/3/library/functions.html#zip")
EOFRepository: NVIDIA/TensorRT-LLM
Length of output: 315
🏁 Script executed:
# Check if there are other uses of assert in the same file that might be problematic
echo "=== Searching for assert statements in py_executor.py ==="
rg -n "^\s*assert\s+" tensorrt_llm/_torch/pyexecutor/py_executor.py | head -20
echo ""
echo "=== Total count of assert statements ==="
rg -c "^\s*assert\s+" tensorrt_llm/_torch/pyexecutor/py_executor.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 919
🏁 Script executed:
# Look at the context around line 3316 to understand the function and importance
echo "=== Context: lines 3300-3340 ==="
sed -n '3300,3340p' tensorrt_llm/_torch/pyexecutor/py_executor.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 2206
Use explicit runtime checks instead of asserts for encoder output distribution.
These invariants currently depend on assert, which can be disabled when Python runs with optimization flags (-O/-OO). When disabled, the plain zip() call will silently truncate if the lists have different lengths, leaving a suffix of requests without py_encoder_output and without the required state transition to CONTEXT_INIT—a difficult-to-debug production issue.
Replace asserts with explicit if checks raising RuntimeError, and add strict=True to the zip() call to catch length mismatches at runtime:
Suggested fix
- assert len(encoder_seq_lens) == len(encoder_requests), (
- "Encoder packed sequence lengths must match the number of "
- "encoder requests")
- assert encoder_hidden_states.shape[0] == sum(encoder_seq_lens), (
- "Encoder packed hidden states first dim must equal "
- "sum(encoder_seq_lens)")
+ if len(encoder_seq_lens) != len(encoder_requests):
+ raise RuntimeError(
+ "Encoder packed sequence lengths must match the number of "
+ "encoder requests")
+ if encoder_hidden_states.shape[0] != sum(encoder_seq_lens):
+ raise RuntimeError(
+ "Encoder packed hidden states first dim must equal "
+ "sum(encoder_seq_lens)")
@@
- for req, seq_len in zip(encoder_requests, encoder_seq_lens):
+ for req, seq_len in zip(
+ encoder_requests, encoder_seq_lens, strict=True):🧰 Tools
🪛 Ruff (0.15.12)
[warning] 3324-3324: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 3316 - 3329,
Replace the two assert checks with explicit runtime validation: verify
len(encoder_seq_lens) == len(encoder_requests) and
encoder_hidden_states.shape[0] == sum(encoder_seq_lens); if either fails, raise
a RuntimeError with a clear message. Also iterate using zip(encoder_requests,
encoder_seq_lens, strict=True) to ensure length mismatches raise immediately,
then continue assigning req.py_encoder_output, req.py_skip_cross_kv_projection =
False and setting req.state = LlmRequestState.CONTEXT_INIT as before (use the
same symbols encoder_seq_lens, encoder_requests, encoder_hidden_states,
py_encoder_output, py_skip_cross_kv_projection, LlmRequestState.CONTEXT_INIT).
| cross_action = self._try_schedule_cross_context(req) | ||
| if cross_action is not ScheduleAction.SCHEDULED: | ||
| self._suspend_request(req) | ||
| return cross_action, 0, False |
There was a problem hiding this comment.
Roll back the cross pool too when decoder-context admission fails.
By this point the request has already resized the self-KV cache, and _try_schedule_cross_context() may also have restored or allocated cross-KV state. Calling _suspend_request(req) only frees the main and draft managers, so cross-pool pages stay pinned and later encoder-decoder admissions can stall once that pool fills.
Suggested fix
def _suspend_request(self, req: LlmRequest) -> None:
"""Suspend a request's KV cache in both main and draft managers.
@@
self.kv_cache_manager.suspend_request(req)
if self.draft_kv_cache_manager is not None:
self.draft_kv_cache_manager.suspend_request(req)
+ if self.cross_kv_cache_manager is not None:
+ self.cross_kv_cache_manager.suspend_request(req)Also applies to: 531-534
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/pyexecutor/scheduler/scheduler_v2.py` around lines 465 -
468, The failure path after calling _try_schedule_cross_context(req) must undo
any cross-pool KV allocations before returning; modify the two failure sites
(the block that checks cross_action != ScheduleAction.SCHEDULED) to call a
rollback/free helper for cross-context state (e.g. add a call to a new or
existing _rollback_cross_context(req) or extend _suspend_request(req) to free
cross-pool pages) so that any allocations or pins made by
_try_schedule_cross_context are released prior to returning; apply the same
change at the other matching location that checks cross_action (the one around
the secondary occurrence at lines 531-534).
| def can_schedule(self, requests: RequestList) -> bool: | ||
| requests = filter_unready_decoder_context_requests(requests) | ||
| fitting_requests, _, _ = self.capacity_scheduler.schedule_request(requests) | ||
| return len(fitting_requests) == len(requests) |
There was a problem hiding this comment.
can_schedule() now returns True for filtered-out requests.
If the only input is an unready CONTEXT_INIT request, filter_unready_decoder_context_requests() turns it into [], and the length check succeeds. That makes can_schedule([req]) report schedulable even though the request is explicitly blocked on encoder output readiness.
♻️ Suggested fix
def can_schedule(self, requests: RequestList) -> bool:
- requests = filter_unready_decoder_context_requests(requests)
- fitting_requests, _, _ = self.capacity_scheduler.schedule_request(requests)
- return len(fitting_requests) == len(requests)
+ ready_requests = filter_unready_decoder_context_requests(requests)
+ if len(ready_requests) != len(requests):
+ return False
+ fitting_requests, _, _ = self.capacity_scheduler.schedule_request(
+ ready_requests
+ )
+ return len(fitting_requests) == len(ready_requests)Also applies to: 1830-1834
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py` around lines 383 -
386, can_schedule currently filters out unready CONTEXT_INIT requests before
comparing lengths, causing it to return True when all original requests were
filtered out; update can_schedule to first capture the original requests list,
run filter_unready_decoder_context_requests, and immediately return False if any
requests were filtered out (i.e., len(filtered) != len(original)); otherwise
call self.capacity_scheduler.schedule_request(filtered) and return True only if
len(fitting_requests) == len(original) (or len(filtered) since they’re equal),
referencing can_schedule, filter_unready_decoder_context_requests, and
capacity_scheduler.schedule_request to locate the logic to change.
| assert inputs.get("prompt") is not None \ | ||
| or inputs.get("prompt_token_ids") is not None | ||
| or inputs.get("prompt_token_ids") is not None \ | ||
| or inputs.get("encoder_inputs") is not None \ | ||
| or inputs.get("encoder_input_token_ids") is not None |
There was a problem hiding this comment.
Accept decoder_input_token_ids in dict prompts.
This guard still rejects a request that only provides decoder_input_token_ids, so one of the new encoder-decoder prompt shapes fails before it reaches the executor.
Suggested fix
assert inputs.get("prompt") is not None \
or inputs.get("prompt_token_ids") is not None \
or inputs.get("encoder_inputs") is not None \
- or inputs.get("encoder_input_token_ids") is not None
+ or inputs.get("encoder_input_token_ids") is not None \
+ or inputs.get("decoder_input_token_ids") is not None🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/inputs/data.py` around lines 98 - 101, The current assertion in
the inputs validation checks for "prompt", "prompt_token_ids", "encoder_inputs",
or "encoder_input_token_ids" but misses the new decoder-only shape; update the
guard to also accept inputs.get("decoder_input_token_ids") (i.e., add an OR
inputs.get("decoder_input_token_ids") is not None) so that requests supplying
only decoder_input_token_ids pass validation; modify the assertion line that
references inputs.get("prompt") / "prompt_token_ids" / "encoder_inputs" /
"encoder_input_token_ids" accordingly.
Description
This PR adds the core encoder-decoder request flow, including encoder execution, decoder context/generation handling, cross-attention metadata, and cross-KV cache management.
Key Changes
beam_width > 1for encoder-decoder generation.Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.Summary by CodeRabbit
Release Notes
New Features
Documentation