Skip to content

[TRTLLM-12339][feat] Support T5 encoder-decoder models in the PyTorch backend#13919

Open
cascade812 wants to merge 19 commits intoNVIDIA:mainfrom
cascade812:guiju/en_de3
Open

[TRTLLM-12339][feat] Support T5 encoder-decoder models in the PyTorch backend#13919
cascade812 wants to merge 19 commits intoNVIDIA:mainfrom
cascade812:guiju/en_de3

Conversation

@cascade812
Copy link
Copy Markdown
Collaborator

@cascade812 cascade812 commented May 8, 2026

Description

This PR adds the core encoder-decoder request flow, including encoder execution, decoder context/generation handling, cross-attention metadata, and cross-KV cache management.

Key Changes

  • Add PyTorch backend support for T5 model execution.
  • Enable TRTLLM attention backend support for encoder-decoder cross-attention.
  • Add dual KV cache pool support:
    • self-attention KV cache for decoder tokens
    • cross-attention KV cache for encoder outputs
  • Add precision conversion support for encoder-decoder model weights.
  • Support beam_width > 1 for encoder-decoder generation.
  • Plumb encoder-side inputs through LLM API request handling and executor flow.
  • Add unit and integration coverage for encoder-decoder request handling, scheduling, KV cache behavior, and T5 execution.

Test Coverage

  • Added T5 PyTorch integration tests.
  • Added encoder-decoder request API tests.
  • Added dual-pool KV cache and scheduler unit tests.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for encoder-decoder models (T5, BART, mBART) with dedicated cross-attention KV cache pooling
    • Extended generation API to accept encoder inputs and decoder inputs alongside standard prompts
    • Implemented dual-pool KV cache management for efficient cross-attention memory allocation
    • Added encoder output exposure in generation results
  • Documentation

    • Added comprehensive architecture guide for legacy encoder-decoder execution flow
    • Added encoder-decoder porting guide for PyTorch runtime migration

cascade812 added 19 commits May 7, 2026 16:29
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
Signed-off-by: Guiju Zhang <7135567+cascade812@users.noreply.github.com>
@cascade812 cascade812 requested review from a team as code owners May 8, 2026 20:41
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 8, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Adds dual-pool KV caching with cross-attention, encoder iteration, scheduler gating, extended attention backends, new T5/BART models, API/bindings for encoder inputs and encoder outputs, and comprehensive tests/docs.

Changes

Encoder–Decoder Integration

Layer / File(s) Summary
C++ Schedulers
cpp/include/.../capacityScheduler.h, cpp/tensorrt_llm/batch_manager/capacityScheduler.cpp
Schedulers accept cross-KV manager; encoder-init path; cross budgeting and mirrored state.
C++ Attention/Bindings
cpp/tensorrt_llm/common/attentionOp.cpp, cpp/tensorrt_llm/thop/*, cpp/tensorrt_llm/nanobind/*
Cross-attention parameters wired through bindings and attention runner.
Py Attention Backends
tensorrt_llm/_torch/attention_backend/*
Cross-aware metadata, assertions, kernel selection, enqueue params, and SDPA fallback.
Models & Modules
tensorrt_llm/_torch/models/*, .../modules/*, .../model_config.py
Add T5/BART encoder–decoder models, CrossAttention, abstract layers, registrations, loaders.
Executor/Scheduler/Core
tensorrt_llm/_torch/pyexecutor/*
Build cross KV, split budgets, encoder iteration, gating, dual-pool reservations (V1/V2).
API/Surface
tensorrt_llm/llmapi/llm.py, executor/*, inputs/data.py, llm_args.py
Expose encoder inputs/outputs; forward through executor/worker; validate cross_kv_cache_fraction.
Docs
encoder_decoder_porting_guide.md, legacy_enc_dec_architecture.md
Add porting guide and legacy architecture overview.
Tests
tests/*
Extensive unit/integration coverage for models, attention, KV pools, schedulers, API, and configs.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (12)
tensorrt_llm/executor/executor.py (1)

123-170: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Mirror encoder_input_token_ids on generate().

This wires encoder inputs through generate_async(), but the synchronous generate() wrapper still has no way to accept or forward them. Users of the blocking executor API are therefore missing the new encoder-decoder request path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/executor/executor.py` around lines 123 - 170, The synchronous
wrapper generate() does not accept or forward the new encoder_input_token_ids
parameter, so callers using the blocking API can't use the encoder-decoder path;
update the generate() signature to include encoder_input_token_ids:
Optional[Union[torch.Tensor, np.ndarray, list]] = None and ensure it passes this
argument through when constructing/dispatching the GenerationRequest (same name
used in generate_async and GenerationRequest). Also propagate
encoder_input_token_ids through any internal call sites or helper methods that
currently forward other generation arguments to generate_async or create the
GenerationRequest so behavior is identical to generate_async().
tensorrt_llm/_torch/attention_backend/vanilla.py (1)

400-410: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not use a causal mask on flash-attn cross-attention.

The SDPA fallback correctly disables causality when Q/K lengths differ, but the flash-attn path still sets causal=True for the default mask. For encoder-decoder cross-attention that blocks later encoder tokens and produces wrong attention scores in fp16/bf16.

Suggested fix
         attn_output_unpad = flash_attn_varlen_func(
             q,
             k,
             v,
             cu_seqlens_q,
             cu_seqlens_k,
             max_seqlen_q,
             max_seqlen_k,
             dropout_p=0.0,
             softmax_scale=softmax_scale,
-            causal=attention_mask == PredefinedAttentionMask.CAUSAL,
+            causal=(attention_mask == PredefinedAttentionMask.CAUSAL
+                    and not is_cross),
             alibi_slopes=None,
             deterministic=False,
             return_attn_probs=False,
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/attention_backend/vanilla.py` around lines 400 - 410, The
flash-attn call incorrectly applies a causal mask for cross-attention; update
the causal argument of flash_attn_varlen_func so it is True only for true
self-attention. Compute a cross-attention condition (e.g., compare cu_seqlens_q
and cu_seqlens_k shapes or values) and set causal = (attention_mask ==
PredefinedAttentionMask.CAUSAL) and not is_cross_attention before calling
flash_attn_varlen_func; use the same logic/sentinel used by the SDPA fallback to
detect differing Q/K lengths so encoder-decoder attention does not get a causal
mask.
tensorrt_llm/_torch/pyexecutor/llm_request.py (2)

334-339: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t put raw encoder_output tensors into Diff without normalizing them first.

Diff is the cross-rank sync payload, but get_diff() only moves logits to CPU. As written, a CUDA encoder_output can escape into the transport/serialization path and either fail there or keep a large device allocation alive unexpectedly.

Suggested fix
     def get_diff(self) -> Diff:
         for i, context_logits in enumerate(self.diff.context_logits_list):
             self.diff.context_logits_list[i] = context_logits.to("cpu")
         for i, generation_logits in enumerate(self.diff.generation_logits_list):
             self.diff.generation_logits_list[i] = generation_logits.to("cpu")
+        if self.diff.encoder_output is not None:
+            self.diff.encoder_output = self.diff.encoder_output.to(
+                "cpu", non_blocking=True
+            )
         return self.diff

Also applies to: 357-358, 423-425

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/llm_request.py` around lines 334 - 339,
get_diff currently only moves logits to CPU but may leave raw CUDA
encoder_output tensors inside Diff and other cross-rank payloads; update
get_diff (and the analogous places around lines where encoder_output is added —
e.g., the blocks referenced at 357-358 and 423-425) to normalize tensors before
inserting into Diff: for each tensor (encoder_output, items in
context_logits_list and generation_logits_list) call
.detach().cpu().contiguous() (or .clone().detach().cpu() if contiguous not
needed) to remove CUDA/device references and break autograd, and replace the
originals in Diff with these CPU tensors so no CUDA memory or device-bound
tensors are serialized or kept alive. Ensure you apply the same normalization
routine wherever encoder_output or other tensors are appended to Diff.

649-679: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add the NVIDIA header to this modified Python file.

This module was updated in the PR but still lacks the required NVIDIA copyright header/current modification year.

As per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/llm_request.py` around lines 649 - 679, Add
the required NVIDIA copyright header (with the current modification year) to the
top of this Python source file (llm_request.py); place the standard NVIDIA
header block before any imports or code (above the class/constructor that
contains the shown super().__init__ calls and symbols like encoder_input_tokens,
encoder_output_len, py_return_encoder_output) and update the copyright year to
the current year.
cpp/include/tensorrt_llm/batch_manager/capacityScheduler.h (1)

169-183: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the copyright year in this modified header.

This header was changed for the PR, but its NVIDIA copyright line still ends at 2024.

As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/include/tensorrt_llm/batch_manager/capacityScheduler.h` around lines 169
- 183, The file capacityScheduler.h was modified but its NVIDIA copyright header
still ends at 2024; update the copyright year to the current year (2026) in the
header comment so the top-of-file NVIDIA copyright line reflects the
modification (e.g., replace "2024" with "2026" or use a range like "2024-2026"),
then save and commit the change.
cpp/tensorrt_llm/thop/attentionOp.cpp (1)

679-699: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require encoder lengths on the cross-attention path.

cross_attention=true is now accepted even when encoder_input_lengths is absent, but Runner::run() only populates the cross-attention length pointer when that tensor exists. That leaves the enqueue params under-specified for the new mode and turns a bad caller into wrong kernel bounds instead of a deterministic error.

Suggested guard
     TLLM_CHECK_WITH_INFO(is_mla_enable || is_fused_qkv || use_sage_attn || cross_attention,
         "Only fused QKV is supported for non-MLA non-SageAttention non-cross attention now");
     TLLM_CHECK_WITH_INFO(update_kv_cache || cross_attention, "KV cache update cannot be disabled now");
+    TLLM_CHECK_WITH_INFO(
+        !cross_attention || encoder_input_lengths.has_value(),
+        "encoder_input_lengths must be provided for cross attention");
+    TLLM_CHECK_WITH_INFO(
+        !cross_attention || !update_kv_cache || cross_kv.has_value(),
+        "cross_kv must be provided when cross attention updates KV cache");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/thop/attentionOp.cpp` around lines 679 - 699, When
cross_attention is true you must require encoder_input_lengths to be present to
avoid creating under-specified enqueue params; add a runtime check in the
attention op (near the existing TLLM_CHECK_WITH_INFO calls in the function that
has parameters cross_attention and encoder_input_lengths) that asserts
encoder_input_lengths.has_value() when cross_attention is true and emits a clear
error via TLLM_CHECK_WITH_INFO (or similar) referencing cross_attention and
encoder_input_lengths; this will surface a deterministic error (rather than
letting Runner::run produce missing cross-attention length pointers and invalid
kernel bounds).
tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py (1)

60-89: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Initialize is_encoder_init_state explicitly on the request mocks.

KVCacheV2Scheduler._sort_requests() now branches on req.is_encoder_init_state. On plain Mock objects, a missing attribute becomes a truthy child mock, so context requests get misclassified as encoder-init and the new chunk-order assertions can pass even if that logic regresses.

Suggested fix
 def make_ctx_request(
@@
     req.lora_task_id = lora_task_id
     req.is_context_init_state = True
+    req.is_encoder_init_state = False
     req.is_generation_in_progress_state = False
     req.encoder_output_len = encoder_output_len
     req.py_encoder_output_ready_event = None
     req.py_skip_cross_kv_projection = False
     return req
 
 
 def make_encoder_request(request_id, encoder_output_len, lora_task_id=None):
@@
     req.encoder_output_len = encoder_output_len
     req.lora_task_id = lora_task_id
     req.is_context_init_state = False
+    req.is_encoder_init_state = True
     req.is_generation_in_progress_state = False
     req.is_first_context_chunk = True
     return req

Also applies to: 92-102

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py` around lines 60
- 89, The request mocks created by make_ctx_request are missing an explicit
is_encoder_init_state attribute so Mock returns a truthy child mock and causes
KVCacheV2Scheduler._sort_requests to misclassify context requests; update
make_ctx_request to set req.is_encoder_init_state = False (and likewise set
is_encoder_init_state explicitly in the other helper around lines 92-102) so
encoder-init branching in _sort_requests behaves deterministically.
tensorrt_llm/_torch/pyexecutor/_util.py (1)

1-25: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the required NVIDIA file header.

This modified Python file still starts directly with imports, so it is missing the required NVIDIA copyright header/current modification year.

As per coding guidelines, "**/*.{cpp,cc,cxx,h,hpp,py}: Read and follow CODING_GUIDELINES.md for all code changes (C++ and Python) Include NVIDIA copyright header on all new files; update year on modified files" and "**/*.{cpp,h,hpp,cu,cuh,py}: All C++, Python, and other source files must contain NVIDIA copyright header with current modification year".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/_util.py` around lines 1 - 25, This file is
missing the required NVIDIA copyright/header; add the official NVIDIA file
header (with the current modification year) to the very top of
tensorrt_llm/_torch/pyexecutor/_util.py before any imports so the header
precedes the existing module-level imports and symbols (e.g., torch,
tensorrt_llm, DecodingMode, MODEL_CLASS_VISION_ENCODER_MAPPING); if this is a
modification of an existing file, update the year in the header to the current
year.
tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py (1)

1-16: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the required NVIDIA file header.

This modified Python file still starts directly with imports, so it is missing the required NVIDIA copyright header/current modification year.

As per coding guidelines, "**/*.{cpp,cc,cxx,h,hpp,py}: Read and follow CODING_GUIDELINES.md for all code changes (C++ and Python) Include NVIDIA copyright header on all new files; update year on modified files" and "**/*.{cpp,h,hpp,cu,cuh,py}: All C++, Python, and other source files must contain NVIDIA copyright header with current modification year".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py` around lines 1 - 16,
Add the required NVIDIA copyright/header block to the very top of this Python
module (above all imports) per CODING_GUIDELINES.md, updating the modification
year to the current year; ensure the header matches the project's standard
NVIDIA copyright template and includes the copyright owner and year range, then
commit the file with the header in place (this file contains symbols like the
module imports tensorrt_llm.bindings/internal, the import lines for LlmRequest
and LlmRequestState, and references to CapacitySchedulerPolicy and logger to
help locate the file).
tensorrt_llm/executor/result.py (1)

1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add required NVIDIA source header to this modified Python file.

This file was modified but still lacks the required NVIDIA copyright/license header.

Suggested patch
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 import asyncio

As per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/executor/result.py` at line 1, This Python module
(tensorrt_llm.executor.result.py) is missing the required NVIDIA
copyright/license header; add the standard NVIDIA source header (with the
current modification year) at the very top of result.py before any code (e.g.,
before the existing "import asyncio") so the file complies with the project's
C++/Python source header policy.
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)

2-3: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update copyright year range on this modified file.

The file changed in this PR but the header still ends at 2025.

Suggested patch
- * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp` around lines 2 - 3,
Update the copyright header's year range in the SPDX comments: change the
SPDX-FileCopyrightText line that currently ends "2025" to include the current
year (e.g., "2026") so the range reads "2022-2026" while leaving the
SPDX-License-Identifier line unchanged.
tests/unittest/_torch/executor/test_request_utils.py (1)

1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the required NVIDIA source header to this modified test file.

Even for unit-test Python sources, the repository rule requires the NVIDIA copyright/license header.

Suggested patch
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 """Tests for request_utils.py functions.

As per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/executor/test_request_utils.py` at line 1, Add the
required NVIDIA source header to the top of the modified test module
test_request_utils.py by inserting the standard NVIDIA copyright/license header
(with the current modification year) as the first thing in the file, before the
existing module docstring and imports so the file complies with the repository’s
source-header rule.
🟡 Minor comments (8)
legacy_enc_dec_architecture.md-123-123 (1)

123-123: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifier to fenced code block.

The fenced code block should specify a language identifier for better rendering and to satisfy markdown linting rules.

📝 Proposed fix
-```
+```text
 out/<tpX>/encoder/rank0.engine, config.json
 out/<tpX>/decoder/rank0.engine, config.json
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @legacy_enc_dec_architecture.md at line 123, The fenced code block at the end
of the file lacks a language identifier; update the block delimiter to include a
language (e.g., change totext) so it becomes a labeled fenced code block
for proper rendering and linting—target the fenced block containing the lines
"out//encoder/rank0.engine, config.json" and
"out//decoder/rank0.engine, config.json" and add the language identifier.


</details>

</blockquote></details>
<details>
<summary>tests/unittest/llmapi/test_llm_args.py-342-353 (1)</summary><blockquote>

`342-353`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Add boundary acceptance tests for `cross_kv_cache_fraction`.**

These new tests cover invalid values, but not valid boundaries. Please also assert that `0.0`, `1.0`, and `None` are accepted.

<details>
<summary>🧪 Suggested test additions</summary>

```diff
+@pytest.mark.parametrize("value", [0.0, 1.0, None])
+def test_KvCacheConfig_accepts_cross_kv_cache_fraction_boundaries(value):
+    cfg = KvCacheConfig(cross_kv_cache_fraction=value)
+    assert cfg.cross_kv_cache_fraction == value
```
</details>

As per coding guidelines, "Coverage expectations: Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix."

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_llm_args.py` around lines 342 - 353, Add
positive-boundary tests for KvCacheConfig to ensure cross_kv_cache_fraction
accepts 0.0, 1.0, and None: create new tests (e.g.,
test_KvCacheConfig_accepts_zero_fraction,
test_KvCacheConfig_accepts_one_fraction,
test_KvCacheConfig_accepts_none_fraction) that instantiate
KvCacheConfig(cross_kv_cache_fraction=0.0),
KvCacheConfig(cross_kv_cache_fraction=1.0), and
KvCacheConfig(cross_kv_cache_fraction=None) respectively and assert the object
is created (no exception) and the attribute equals the provided value; place
these alongside the existing failing-value tests that reference KvCacheConfig so
behavior is fully covered.
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/inputs/data.py-38-45 (1)</summary><blockquote>

`38-45`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**The public prompt types still require `prompt` / `prompt_token_ids`.**

`prompt_inputs()` now accepts encoder-only dictionaries, but `TextPrompt` and `TokensPrompt` still model `prompt` / `prompt_token_ids` as required keys. That leaves the published type contract out of sync with the runtime API for encoder-decoder requests.

 


Also applies to: 78-85

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/inputs/data.py` around lines 38 - 45, TextPrompt and
TokensPrompt currently require prompt / prompt_token_ids but prompt_inputs()
accepts encoder-only dicts; update the prompt type definitions to make prompt
and prompt_token_ids optional (NotRequired/Optional) to match encoder-decoder
requests. Locate the TextPrompt and TokensPrompt type/class definitions and
change the required keys for "prompt" and "prompt_token_ids" to optional (align
with encoder_input_token_ids / encoder_inputs used elsewhere) so the public type
contract matches runtime behavior; ensure any related docstrings/comments
reflect the optionality.
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/_torch/pyexecutor/resource_manager.py-704-732 (1)</summary><blockquote>

`704-732`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Add the NVIDIA header to this modified Python file.**

This file was changed in the PR but still starts without the required NVIDIA copyright header/current modification year.

 
As per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year".

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 704 - 732,
This Python source (tensorrt_llm/_torch/pyexecutor/resource_manager.py) is
missing the required NVIDIA copyright header; add the standard NVIDIA
copyright/header comment block (with the current modification year) at the very
top of the file before any imports or code, ensuring it follows the project's
header format used in other .py files; no functional code changes to symbols
like CacheTypeCpp, kv_cache_type, add_sequence_batch, refresh_blocks, or
class/method definitions are needed—just insert the header comment block exactly
as mandated by the coding guidelines.
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/_torch/modules/encoder_decoder_layer.py-29-36 (1)</summary><blockquote>

`29-36`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Split abstract method declarations onto proper multi-line bodies to comply with PEP 8.**

Both `forward` methods have one-line bodies that violate flake8 E704 ("multiple statements on one line"). Move the `...` ellipsis to its own line:

<details>
<summary>Suggested fix</summary>

```diff
     `@abstractmethod`
     def forward(
         self,
         hidden_states: torch.Tensor,
         attn_metadata: AttentionMetadata,
         position_ids: Optional[torch.IntTensor] = None,
         **kwargs,
-    ) -> torch.Tensor: ...
+    ) -> torch.Tensor:
+        ...
@@
     `@abstractmethod`
     def forward(
         self,
         position_ids: torch.IntTensor,
         hidden_states: torch.Tensor,
         attn_metadata: AttentionMetadata,
         encoder_hidden_states: Optional[torch.Tensor] = None,
         cross_attn_metadata: Optional[AttentionMetadata] = None,
         skip_cross_kv_projection: bool = False,
         **kwargs,
-    ) -> torch.Tensor: ...
+    ) -> torch.Tensor:
+        ...
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/encoder_decoder_layer.py` around lines 29 - 36,
The abstract forward method currently uses a one-line body with the ellipsis,
violating flake8 E704; update the forward declaration in the class (the def
forward(...) method) so the signature ends with a colon and the ellipsis is
placed on its own indented line (i.e., split the method into a proper multi-line
body) to satisfy PEP 8 and flake8 checks.
```

</details>

</blockquote></details>
<details>
<summary>encoder_decoder_porting_guide.md-9-10 (1)</summary><blockquote>

`9-10`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Update the top-level contract to match the implementation.**

This still says enc-dec is “V2-only” and baseline beam width is `1`, but this PR adds V1 wiring/smoke coverage and explicitly calls out `beam_width > 1` generation support. Please align the guide’s opening contract so readers do not build against stale constraints.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@encoder_decoder_porting_guide.md` around lines 9 - 10, The top-level contract
incorrectly states "V2-only" and beam width = 1; update the opening paragraph to
reflect that encoder-decoder models (T5, BART, mBART) now include V1
wiring/smoke coverage and support generation with beam_width > 1. Change
references to require use_kv_cache_manager_v2=True and KVCacheManagerV2
constraints to note that V2 is required for full enc-dec features (explicit SELF
+ CROSS pools) but V1 has limited wiring/smoke tests in this PR, and ensure the
contract clearly documents supported beam_width behavior and runtime
requirements (PyTorch vs. other runtimes).
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/_torch/pyexecutor/model_engine.py-4400-4402 (1)</summary><blockquote>

`4400-4402`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**Consider converting assert to explicit RuntimeError for production safety.**

The assertion on line 4400-4402 validates that the packed encoder length doesn't exceed `max_num_tokens`. While the assertion message is informative, assertions can be disabled with `-O` flag in production Python. Consider using an explicit check with `RuntimeError` for consistent error behavior:

<details>
<summary>🛡️ Suggested change</summary>

```diff
-        assert num_tokens <= self.max_num_tokens, (
-            f"encoder packed length ({num_tokens}) exceeds max_num_tokens "
-            f"({self.max_num_tokens})")
+        if num_tokens > self.max_num_tokens:
+            raise RuntimeError(
+                f"encoder packed length ({num_tokens}) exceeds max_num_tokens "
+                f"({self.max_num_tokens}). Reduce batch size or sequence lengths.")
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 4400 - 4402, The
assert checking packed encoder length should be replaced with an explicit
runtime check to avoid being bypassed by Python -O; in the scope where
num_tokens and self.max_num_tokens are available (the code block containing the
current assert), change to an if statement that tests if num_tokens >
self.max_num_tokens and raise RuntimeError with the same formatted message
(include num_tokens and self.max_num_tokens) so the error is always raised in
production.
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py-449-450 (1)</summary><blockquote>

`449-450`: _⚠️ Potential issue_ | _🟡 Minor_ | _⚡ Quick win_

**`PyMicroBatchScheduler.schedule()` bypasses the new readiness gate.**

`_can_be_scheduled()` now rejects decoder-context requests until `py_encoder_output_ready_event` is set, but the scheduling loop re-inlines only the state-range check. A direct call to `schedule()` can still admit an unready `CONTEXT_INIT` request.
 


Also applies to: 488-494

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py` around lines 449 -
450, The scheduling loop in PyMicroBatchScheduler.schedule() bypasses the new
decoder-context readiness gate, so update schedule() to consult the same logic
as _can_be_scheduled(): before admitting a request (particularly CONTEXT_INIT /
decoder-context requests) call is_decoder_context_request_ready(req) or call
_can_be_scheduled(req) so that py_encoder_output_ready_event is respected; apply
the same fix to the second occurrence referenced (the block around the other
admission at lines ~488-494) so no CONTEXT_INIT request is scheduled until
py_encoder_output_ready_event allows it.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🧹 Nitpick comments (12)</summary><blockquote>

<details>
<summary>tests/unittest/llmapi/test_llm_args.py (1)</summary><blockquote>

`342-353`: **QA list update is unnecessary for this test change.**

This PR delta is unit-test scope in `tests/unittest/...`; no `tests/integration/defs/` entry changes are needed for QA scheduled lists.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_llm_args.py` around lines 342 - 353, The change is
only a unit-test update under tests/unittest and the QA scheduled lists should
not be modified; revert or remove any edits in tests/integration/defs/ (or any
QA list files) introduced by this PR, leaving the two new/updated tests
(test_KvCacheConfig_rejects_cross_kv_cache_fraction_below_zero and
test_KvCacheConfig_rejects_cross_kv_cache_fraction_above_one) and the
KvCacheConfig-related assertions as the sole change set.
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/llmapi/llm_args.py (1)</summary><blockquote>

`2607-2617`: _⚡ Quick win_

**Prefer declarative field constraints for `cross_kv_cache_fraction`.**

The custom validator at lines 2607–2616 duplicates constraint logic that belongs in the field definition. Pydantic v2 `Field(ge=0, le=1)` on `Optional[float]` enforces range validation on provided values while allowing `None`, matching the current validator behavior. This aligns with the codebase pattern (e.g., `free_gpu_memory_fraction` uses the same approach) and the coding guideline: "In Python Pydantic models, prefer `PositiveInt`, `NonNegativeInt`, `NonNegativeFloat`, `PositiveFloat`, or `Field(gt=0)` for numeric constraints instead of custom validators."

<details>
<summary>♻️ Proposed refactor</summary>

```diff
     cross_kv_cache_fraction: Optional[float] = Field(
         default=None,
+        ge=0,
+        le=1,
         description=
         "The fraction of the KV Cache memory should be reserved for cross attention. If set to p, self attention will use 1-p of KV Cache memory and cross attention will use p of KV Cache memory. Default is 50%. Should only be set when using encoder-decoder model."
     )

@@
-    `@field_validator`('cross_kv_cache_fraction')
-    `@classmethod`
-    def validate_cross_kv_cache_fraction(cls, v: Optional[float]):
-        if v is None:
-            return v
-        if not 0 <= v <= 1:
-            raise ValueError(
-                "kv_cache_config.cross_kv_cache_fraction must be a float between 0 and 1"
-            )
-        return v
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/llmapi/llm_args.py` around lines 2607 - 2617, The custom
validator validate_cross_kv_cache_fraction duplicates range checks; remove that
validator and declare the field cross_kv_cache_fraction as Optional[float] with
Pydantic constraints (e.g., Field(ge=0, le=1)) in its definition so None is
allowed but any provided value must be between 0 and 1, matching the pattern
used for free_gpu_memory_fraction.
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)</summary><blockquote>

`77-80`: _⚡ Quick win_

**Declare `CROSS_KV_CACHE_MANAGER` as a normal enum member.**

Using `locals()` here is just reflection for a plain enum constant. It makes the symbol harder to grep and reason about, and the direct declaration is clearer.

  
<details>
<summary>Suggested fix</summary>

```diff
 class ResourceManagerType(enum.Enum):
     KV_CACHE_MANAGER = "KV_CACHE_MANAGER"
     DRAFT_KV_CACHE_MANAGER = "DRAFT_KV_CACHE_MANAGER"
-    locals()["CR" + "OSS_KV_CACHE_MANAGER"] = "CR" + "OSS_KV_CACHE_MANAGER"
+    CROSS_KV_CACHE_MANAGER = "CROSS_KV_CACHE_MANAGER"
     PEFT_CACHE_MANAGER = "PEFT_CACHE_MANAGER"
     SEQ_SLOT_MANAGER = "SEQ_SLOT_MANAGER"
     SPEC_RESOURCE_MANAGER = "SPEC_RESOURCE_MANAGER"
```
</details>
As per coding guidelines, "Python code should avoid reflection when functionality can be easily achieved without it".

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 77 - 80, The
enum ResourceManagerType currently injects a member via reflection using
locals()["CR" + "OSS_KV_CACHE_MANAGER"] = "CR" + "OSS_KV_CACHE_MANAGER"; replace
that reflective line with a direct enum member declaration like
CROSS_KV_CACHE_MANAGER = "CROSS_KV_CACHE_MANAGER" inside the ResourceManagerType
class so the member is explicit and discoverable (update any references to
ResourceManagerType.CROSS_KV_CACHE_MANAGER if needed).
```

</details>

</blockquote></details>
<details>
<summary>tests/unittest/_torch/executor/test_encoder_step.py (1)</summary><blockquote>

`406-427`: _⚡ Quick win_

**Add a beam-expanded case here.**

`beam_width > 1` is a headline behavior of this PR, but this helper is only tested with one row per request. A regression where encoder lengths or cached-token counts are not replicated correctly per beam would still pass; a small parametrized case with duplicated beams and mixed cached/new cross-KV would close that gap.



As per coding guidelines, "Coverage expectations: Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix ... Note missing negative tests, missing parametrization where multiple backends or dtypes apply."

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/executor/test_encoder_step.py` around lines 406 - 427,
The test test_builds_metadata_for_mixed_projection_and_cached_sequences lacks a
beam-expanded case; add a subcase where you call
engine._prepare_encoder_decoder_cross_attention_inputs with beam_width > 1
(e.g., 2) and ensure inputs and metadata are validated for replicated beams:
pass encoder_hidden_states duplicated per beam (or a batch with beam dimension),
provide encoder_seq_lens and encoder_num_cached_tokens_per_seq expanded/repeated
per beam, and assert that inputs["encoder_hidden_states"] points to the original
tensor or its proper beam-expanded view, inputs["skip_cross_kv_projection"]
remains False, inputs["cross_attn_metadata"] is metadata.cross_metadata with
metadata.cross_metadata.prepared True, metadata.cross_kv_cache_manager equals
the _FakeResourceManager's cross_manager, and that metadata.encoder_seq_lens and
metadata.encoder_num_cached_tokens_per_seq have correct replicated values for
each beam; reference
test_builds_metadata_for_mixed_projection_and_cached_sequences,
_prepare_encoder_decoder_cross_attention_inputs, _FakeAttentionMetadata, and
_FakeResourceManager when making these additions.
```

</details>

</blockquote></details>
<details>
<summary>tests/integration/defs/llmapi/test_llm_api_pytorch_t5.py (1)</summary><blockquote>

`374-606`: **Consider wiring this into the QA functional list.**

This is core single-node functional coverage for a new encoder-decoder flow, so it likely deserves an entry in `tests/integration/test_lists/qa/llm_function_core.txt` if you want it exercised in scheduled QA. From this file alone, a perf-list update does not look necessary.



As per coding guidelines, "If the change adds or materially alters an integration test under tests/integration/defs/ ... call out whether an entry is needed under tests/integration/test_lists/qa/."

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/defs/llmapi/test_llm_api_pytorch_t5.py` around lines 374 -
606, The new encoder-decoder integration tests
(test_t5_pytorch_generate_encoder_decoder_end_to_end and
test_t5_pytorch_generate_encoder_decoder_cuda_graph_mixed_encoder_lengths_batch)
should be added to the QA functional test list so they run under scheduled QA;
update the QA list (llm_function_core.txt) to include a short entry referencing
these tests (model encoder-decoder flow / T5 PyTorch backend) so the change
under tests/integration/defs/... is explicitly exercised by the QA pipeline.
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/llmapi/llm.py (1)</summary><blockquote>

`628-636`: _💤 Low value_

**Minor indentation style issue flagged by linter.**

Static analysis flagged E121 (continuation line under-indented) at line 636. The closing parenthesis should align with the opening of the function call or maintain consistent indentation with the parameters.



<details>
<summary>🔧 Suggested fix for indentation</summary>

```diff
         (prompt_token_ids, prompt, query_token_ids, multimodal_params,
          encoder_input_token_ids) = self._preprocess(
-             inputs,
-             sampling_params,
-             disaggregated_params,
-             encoder_inputs=encoder_inputs,
-             encoder_input_token_ids=encoder_input_token_ids,
-             decoder_input_token_ids=decoder_input_token_ids,
-         )
+            inputs,
+            sampling_params,
+            disaggregated_params,
+            encoder_inputs=encoder_inputs,
+            encoder_input_token_ids=encoder_input_token_ids,
+            decoder_input_token_ids=decoder_input_token_ids,
+        )
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/llmapi/llm.py` around lines 628 - 636, The continuation call to
self._preprocess is under-indented causing E121; fix by aligning the closing
parenthesis with the opening parenthesis of the call or by indenting it to match
the parameter block so the call looks consistently formatted; locate the call to
_preprocess (the line starting with (prompt_token_ids, prompt, query_token_ids,
multimodal_params, encoder_input_token_ids) = self._preprocess(...)) and adjust
the indentation of the closing parenthesis and any continuation lines so they
align with the opening self._preprocess( or match the indentation level used for
the other multi-line calls in the file.
```

</details>

</blockquote></details>
<details>
<summary>tests/unittest/_torch/modeling/test_modeling_enc_dec.py (1)</summary><blockquote>

`684-688`: _💤 Low value_

**Consider extracting `_make_cross_attn` as a module-level helper.**

The smoke benchmark instantiates `TestCrossAttentionTrtllmBackend()` just to call its `_make_cross_attn` method. This is functional but couples the smoke test to the test class structure.

Consider moving `_make_cross_attn` to a module-level helper function (similar to `_build_trtllm_cross_metadata`) for cleaner reuse across test classes.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/modeling/test_modeling_enc_dec.py` around lines 684 -
688, Extract the TestCrossAttentionTrtllmBackend._make_cross_attn method into a
module-level helper function (e.g., make_cross_attn) and update callers to use
that helper instead of instantiating TestCrossAttentionTrtllmBackend;
specifically, move the implementation out of the class into a top-level function
alongside helpers like _build_trtllm_cross_metadata, keep the same parameters
(hidden_size, num_heads, head_dim, dtype, device), preserve behavior and
returned tensor/device handling, and replace usages such as
TestCrossAttentionTrtllmBackend()._make_cross_attn(...) with
make_cross_attn(...).
```

</details>

</blockquote></details>
<details>
<summary>tensorrt_llm/_torch/pyexecutor/model_engine.py (2)</summary><blockquote>

`1872-1905`: _💤 Low value_

**Consider extracting the nested `update_cross_metadata` function.**

The nested `update_cross_metadata` function directly modifies internal attributes like `_seq_lens` and `_seq_lens_cuda`. While this works, it couples tightly to `AttentionMetadata` internals.

Additionally, the function captures variables from the enclosing scope (`attn_metadata`, `encoder_seq_lens_tensor`, etc.) which makes the data flow less explicit. Consider whether this could be a method on `AttentionMetadata` itself or at minimum accept all dependencies as parameters for clarity.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 1872 - 1905, The
nested function update_cross_metadata tightly couples to AttentionMetadata
internals and captures external variables (attn_metadata,
encoder_seq_lens_tensor, encoder_num_cached_tokens_per_seq,
cross_kv_cache_manager), making data flow implicit; refactor by extracting
update_cross_metadata into either an instance/class method on AttentionMetadata
(e.g., AttentionMetadata.update_for_cross(attn_metadata,
encoder_seq_lens_tensor, encoder_num_cached_tokens_per_seq,
cross_kv_cache_manager)) or a standalone function that takes all dependencies as
explicit parameters, and move creation of KVCacheParams (currently constructed
inside update_cross_metadata) into that method so it only mutates public
attributes (avoid direct writes to _seq_lens/_seq_lens_cuda) and document
required inputs.
```

</details>

---

`4468-4477`: _💤 Low value_

**Consider using `_get_top_level_model()` for consistency with the embedding lookup pattern.**

The encoder discovery at lines 4468-4472 retrieves the encoder directly from `self.model` and `self.model.model`, but the embedding lookup at line 4484 uses `_get_top_level_model()` to unwrap compiled models. For consistency and to guard against potential edge cases with `torch.compile` wrapped submodules, refactor the encoder discovery to use the same `_get_top_level_model()` pattern.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 4468 - 4477, The
encoder lookup currently reads encoder from self.model and self.model.model
directly which can miss wrapped/compiled submodules; change the discovery to
first call _get_top_level_model(self.model) (the same helper used by the
embedding lookup) and then get the "encoder" attribute (and fallback to
inner.model.encoder if necessary) so encoder resolution mirrors the embedding
pattern—update the block that references encoder and inner to use
_get_top_level_model(...) and retain the existing AttributeError message if no
encoder is found.
```

</details>

</blockquote></details>
<details>
<summary>tests/unittest/_torch/executor/test_py_scheduler.py (1)</summary><blockquote>

`2506-2516`: _⚡ Quick win_

**Add the constrained cross-pool case for `MAX_UTILIZATION`.**

This test uses a roomy cross pool, so it still passes if `ENCODER_INIT` is accidentally charged against cross-KV blocks in the max-utilization path. Using something like `MockKVCacheManager(num_free_blocks=1, blocks_per_request=2)` here would mirror the guaranteed-no-evict regression case and actually prove encoder admission is cross-pool-neutral under both policies.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/executor/test_py_scheduler.py` around lines 2506 -
2516, The test test_max_utilization_admits_encoder_with_cross_pool currently
uses a roomy cross pool; change the cross_kv MockKVCacheManager instantiation to
mimic the constrained cross-pool case (e.g.,
MockKVCacheManager(num_free_blocks=1, blocks_per_request=2)) so the test
verifies MAX_UTILIZATION does not incorrectly charge ENCODER_INIT to cross-KV
blocks; keep the same requests (make_encoder_request and
make_generation_request), call scheduler.schedule_request, and retain the
existing assertions on fitting and paused to ensure encoder admission is
cross-pool-neutral under this policy.
```

</details>

</blockquote></details>
<details>
<summary>tests/unittest/llmapi/test_encoder_decoder_request_api.py (1)</summary><blockquote>

`77-82`: _⚡ Quick win_

**Add one behavioral `BaseLLM.generate` test, not just a signature check.**

The async path is covered end-to-end, but the sync wrapper only gets a parameter-order assertion. If `generate()` drops `encoder_input_token_ids` or handles the conflict checks differently from `generate_async()`, this file stays green.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/llmapi/test_encoder_decoder_request_api.py` around lines 77 -
82, The test currently only asserts parameter order for BaseLLM.generate; add a
behavioral unit test that calls BaseLLM.generate (or a small test
subclass/mocked LLM implementing generate_async) to verify that when both
priority and encoder_input_token_ids/encoder_inputs kwargs are provided,
generate forwards encoder inputs and enforces the same conflict/priority
behavior as generate_async — specifically instantiate a fake LLM where
generate_async records received kwargs, call the synchronous BaseLLM.generate
wrapper with both priority and encoder_input_token_ids/encoder_inputs, and
assert the recorded args include the encoder inputs (and that priority
wins/behavior matches generate_async) to exercise the actual code paths rather
than just signatures.
```

</details>

</blockquote></details>
<details>
<summary>tests/unittest/_torch/executor/test_request_utils.py (1)</summary><blockquote>

`64-109`: **QA list update is not needed for this test change.**

This is unit-test scope only (`tests/unittest/**`), so integration/release QA list edits are unnecessary in this PR.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/executor/test_request_utils.py` around lines 64 - 109,
This PR only changes unit tests (e.g.,
test_executor_request_to_llm_request_preserves_encoder_tokens and
test_scheduled_requests_does_not_query_encoder_context_chunk_state using
ScheduledRequests/EncoderInitRequest), so remove any unrelated QA list /
integration/release QA edits from the commit and keep only the test file
modifications; ensure the diff contains solely the test changes and no QA list
or release checklist updates before finalizing the PR.
```

</details>

</blockquote></details>

</blockquote></details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Enterprise

**Run ID**: `be032c9d-11c9-4f74-b7b7-0350127c8b65`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between f8572abec4bf623dc142d7df8babc67ab1616996 and e4752c379acc0e88142ba6d285b588b3e869be7b.

</details>

<details>
<summary>📒 Files selected for processing (48)</summary>

* `cpp/include/tensorrt_llm/batch_manager/capacityScheduler.h`
* `cpp/tensorrt_llm/batch_manager/capacityScheduler.cpp`
* `cpp/tensorrt_llm/common/attentionOp.cpp`
* `cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp`
* `cpp/tensorrt_llm/nanobind/thop/bindings.cpp`
* `cpp/tensorrt_llm/thop/attentionOp.cpp`
* `cpp/tensorrt_llm/thop/attentionOp.h`
* `cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp`
* `encoder_decoder_porting_guide.md`
* `legacy_enc_dec_architecture.md`
* `tensorrt_llm/_torch/attention_backend/interface.py`
* `tensorrt_llm/_torch/attention_backend/trtllm.py`
* `tensorrt_llm/_torch/attention_backend/trtllm_gen.py`
* `tensorrt_llm/_torch/attention_backend/vanilla.py`
* `tensorrt_llm/_torch/model_config.py`
* `tensorrt_llm/_torch/models/__init__.py`
* `tensorrt_llm/_torch/models/modeling_bart.py`
* `tensorrt_llm/_torch/models/modeling_t5.py`
* `tensorrt_llm/_torch/modules/cross_attention.py`
* `tensorrt_llm/_torch/modules/encoder_decoder_layer.py`
* `tensorrt_llm/_torch/modules/rms_norm.py`
* `tensorrt_llm/_torch/pyexecutor/_util.py`
* `tensorrt_llm/_torch/pyexecutor/llm_request.py`
* `tensorrt_llm/_torch/pyexecutor/model_engine.py`
* `tensorrt_llm/_torch/pyexecutor/model_loader.py`
* `tensorrt_llm/_torch/pyexecutor/py_executor.py`
* `tensorrt_llm/_torch/pyexecutor/resource_manager.py`
* `tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py`
* `tensorrt_llm/_torch/pyexecutor/scheduler/scheduler_v2.py`
* `tensorrt_llm/executor/base_worker.py`
* `tensorrt_llm/executor/executor.py`
* `tensorrt_llm/executor/request.py`
* `tensorrt_llm/executor/result.py`
* `tensorrt_llm/inputs/data.py`
* `tensorrt_llm/llmapi/llm.py`
* `tensorrt_llm/llmapi/llm_args.py`
* `tests/integration/defs/llmapi/test_llm_api_pytorch_t5.py`
* `tests/unittest/_torch/executor/test_dual_pool_kv_cache.py`
* `tests/unittest/_torch/executor/test_encoder_step.py`
* `tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py`
* `tests/unittest/_torch/executor/test_py_scheduler.py`
* `tests/unittest/_torch/executor/test_request_utils.py`
* `tests/unittest/_torch/modeling/test_modeling_enc_dec.py`
* `tests/unittest/_torch/test_model_config.py`
* `tests/unittest/api_stability/references/llm.yaml`
* `tests/unittest/api_stability/references/request_output.yaml`
* `tests/unittest/llmapi/test_encoder_decoder_request_api.py`
* `tests/unittest/llmapi/test_llm_args.py`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines +457 to +475
if encoder_num_cached_tokens_per_seq is not None:
from ..metadata import KVCacheParams
base_params = self.kv_cache_params
cross_md.kv_cache_params = KVCacheParams(
use_cache=base_params.use_cache if base_params is not None else
(cross_kv_cache_manager is not None),
num_cached_tokens_per_seq=list(
encoder_num_cached_tokens_per_seq),
block_ids_per_seq=base_params.block_ids_per_seq
if base_params is not None else None,
host_max_attention_window_sizes=base_params.
host_max_attention_window_sizes
if base_params is not None else None,
host_sink_token_length=base_params.host_sink_token_length
if base_params is not None else None,
num_extra_kv_tokens=base_params.num_extra_kv_tokens
if base_params is not None else 0,
)
cross_md.__post_init__()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

None does not currently default cross cached-token counts to zero.

create_cross_metadata() documents that omitting encoder_num_cached_tokens_per_seq means “nothing cached yet”, but in that branch cross_md keeps the shallow-copied self.kv_cache_params. On generation steps that can reuse the decoder cached-token counts for the cross pool and corrupt cross-attention cache accounting.

Suggested fix
-        if encoder_num_cached_tokens_per_seq is not None:
-            from ..metadata import KVCacheParams
-            base_params = self.kv_cache_params
-            cross_md.kv_cache_params = KVCacheParams(
-                use_cache=base_params.use_cache if base_params is not None else
-                (cross_kv_cache_manager is not None),
-                num_cached_tokens_per_seq=list(
-                    encoder_num_cached_tokens_per_seq),
-                block_ids_per_seq=base_params.block_ids_per_seq
-                if base_params is not None else None,
-                host_max_attention_window_sizes=base_params.
-                host_max_attention_window_sizes
-                if base_params is not None else None,
-                host_sink_token_length=base_params.host_sink_token_length
-                if base_params is not None else None,
-                num_extra_kv_tokens=base_params.num_extra_kv_tokens
-                if base_params is not None else 0,
-            )
+        from ..metadata import KVCacheParams
+        base_params = self.kv_cache_params
+        cached_tokens = (
+            list(encoder_num_cached_tokens_per_seq)
+            if encoder_num_cached_tokens_per_seq is not None
+            else [0] * int(encoder_seq_lens.numel())
+        )
+        cross_md.kv_cache_params = KVCacheParams(
+            use_cache=base_params.use_cache if base_params is not None else
+            (cross_kv_cache_manager is not None),
+            num_cached_tokens_per_seq=cached_tokens,
+            block_ids_per_seq=base_params.block_ids_per_seq
+            if base_params is not None else None,
+            host_max_attention_window_sizes=base_params.
+            host_max_attention_window_sizes
+            if base_params is not None else None,
+            host_sink_token_length=base_params.host_sink_token_length
+            if base_params is not None else None,
+            num_extra_kv_tokens=base_params.num_extra_kv_tokens
+            if base_params is not None else 0,
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/attention_backend/interface.py` around lines 457 - 475,
The branch that handles encoder_num_cached_tokens_per_seq leaves cross_md using
the shallow-copied self.kv_cache_params when encoder_num_cached_tokens_per_seq
is None, so cross cached-token counts should instead be explicitly initialized
to "nothing cached yet" (zeros) to avoid reusing decoder counts; modify
create_cross_metadata so that when encoder_num_cached_tokens_per_seq is None you
construct a new KVCacheParams (like you do in the not-None branch) based on
base_params/self.kv_cache_params but override num_cached_tokens_per_seq with a
list of zeros of the appropriate length (e.g.,
len(base_params.num_cached_tokens_per_seq) if base_params is present, otherwise
an empty list or zeros based on expected sequence count), preserving other
fields (use_cache, block_ids_per_seq, host_max_attention_window_sizes,
host_sink_token_length, num_extra_kv_tokens) and then call
cross_md.__post_init__.

Comment on lines +164 to 167
# Cross-attention generation steps legitimately need
# update_kv_cache=False (encoder K/V are already cached).
if not update_kv_cache and not cross_attention:
return False, "KV cache update cannot be disabled now."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the update_kv_cache=False exemption generation-only.

This relaxation is correct for cross-attention decode, but is_supported() still defaults to phase="both". As written, cross_attention=True + update_kv_cache=False is also accepted for context/both-phase checks, even though cross-attention context is where we populate the cross cache from cross_kv_input.

Suggested fix
-        # Cross-attention generation steps legitimately need
-        # update_kv_cache=False (encoder K/V are already cached).
-        if not update_kv_cache and not cross_attention:
+        # Cross-attention decode legitimately needs update_kv_cache=False
+        # because encoder K/V are already cached. Context/prefill still has
+        # to materialize those K/V into the cross cache.
+        allow_disabled_kv_update = cross_attention and phase == "generation"
+        if not update_kv_cache and not allow_disabled_kv_update:
             return False, "KV cache update cannot be disabled now."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/attention_backend/trtllm_gen.py` around lines 164 - 167,
The current check in is_supported() allows update_kv_cache=False whenever
cross_attention is True, which incorrectly permits that exemption for
non-generation phases (phase defaults to "both"); change the condition so
update_kv_cache can only be disabled when cross_attention is True AND phase ==
"generate". Concretely, update the if guarding update_kv_cache/cross_attention
to something like: if not update_kv_cache and not (cross_attention and phase ==
"generate"): return False, "...", so the exemption is generation-only; reference
symbols: is_supported(), update_kv_cache, cross_attention, phase.

Comment on lines +1351 to +1352
encoder_seq_lens = (metadata.kv_lens_cuda_runtime
if metadata.is_cross else None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t use total cached KV lengths for context-side cross_kv_input.

In mixed reuse batches, metadata.kv_lens_cuda_runtime includes cached cross-KV, but cross_kv_input only contains the freshly projected encoder rows. For a batch like encoder_seq_lens=[2,0,0] and cached lengths [0,5,7], this passes lengths [2,5,7] alongside a 2-row cross_kv_input, so the C++ path overestimates the append geometry. Use the raw projection lengths when update_kv_cache is true, and reserve kv_lens_* for cache-read-only steps.

Also applies to: 1474-1505

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py` around lines 1351 - 1352,
The code currently sets encoder_seq_lens = metadata.kv_lens_cuda_runtime when
metadata.is_cross, but in mixed reuse batches this includes cached cross-KV rows
and overestimates lengths for the freshly-projected cross_kv_input; change the
logic in trtllm.py where encoder_seq_lens is computed so that when
metadata.is_cross and metadata.update_kv_cache is true (i.e., we are appending
new projected encoder rows) you use the raw projection lengths (the per-example
projection lengths source used to build cross_kv_input) instead of
metadata.kv_lens_cuda_runtime, and only use metadata.kv_lens_cuda_runtime /
kv_lens_* when doing cache-read-only steps (update_kv_cache is false); update
the analogous block(s) referenced around encoder_seq_lens and the later repeated
section (the area corresponding to the second occurrence ~1474-1505) to apply
the same conditional.

@@ -1,6 +1,8 @@
import transformers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the required NVIDIA source header.

This modified Python source file is missing the required copyright/license header.

Suggested patch
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 import transformers

As per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import transformers
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import transformers
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/models/__init__.py` at line 1, This file is missing the
required NVIDIA source/license header; add the standard NVIDIA copyright/license
header (with the current modification year) at the very top of the Python module
before the existing import transformers statement so the header appears above
all code in __init__.py.

Comment on lines +351 to +353
if position_ids is not None:
hidden_states = hidden_states + self.embed_positions(position_ids)
hidden_states = self.layernorm_embedding(hidden_states)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

fd "modeling_bart.py" --type f

Repository: NVIDIA/TensorRT-LLM

Length of output: 107


🏁 Script executed:

cat -n tensorrt_llm/_torch/models/modeling_bart.py | sed -n '345,360p; 390,405p; 428,440p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 2163


🌐 Web query:

Hugging Face BART positional embeddings offset HF embeddings reserved rows

💡 Result:

In Hugging Face's Transformers library implementation of BART, the BartLearnedPositionalEmbedding class uses a hardcoded self.offset = 2[1][2][3]. This offsets the position indices before looking up embeddings in the nn.Embedding layer, which is sized num_embeddings + 2 (e.g., 1024 + 2 = 1026 for max_position_embeddings=1024)[2][4]. The offset reserves the first two embedding rows (indices 0 and 1) and is a BART-specific hack to align with pretraining conventions[1][2]. BART's pad_token_id is 1[5], and historically, avoiding padding_idx in nn.Embedding prevented zeroing out learnable positions (e.g., position 1 would be untrainable if set as padding_idx)[6]. Instead of using padding_idx, positions start from 2: a sequence position 0 maps to embedding index 2, position 1 to 3, etc.[1][6]. This ensures compatibility with pretrained BART checkpoints (e.g., facebook/bart-base, facebook/bart-large) without losing positional information at low indices[6]. The comment explicitly states: "Bart is set up so that if padding_idx is specified then offset the embedding ids by 2 and adjust num_embeddings appropriately. Other models don't have this hack"[1][2]. No evidence of changes post-2023; the main branch still uses offset=2 as of 2026[2].

Citations:


Apply +2 positional-embedding offset before lookup in encoder and decoder.

HF BART/mBART reserve rows 0 and 1 in embed_positions; actual token position 0 lives at index 2. The code correctly stores position_id_offset = 2 (line 435) with an explanatory comment but never applies this offset when accessing the embeddings. Both encoder (lines 351–353) and decoder (lines 396–398) index with raw position_ids, so loading pretrained checkpoints will read misaligned rows and produce incorrect positional embeddings.

Minimal fix
        if position_ids is not None:
-           hidden_states = hidden_states + self.embed_positions(position_ids)
+           hidden_states = hidden_states + self.embed_positions(position_ids + 2)

Apply the same fix at both lines 352 and 397.

Also applies to: encoder (line 352), decoder (line 397)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/models/modeling_bart.py` around lines 351 - 353, The
positional-embedding lookup uses raw position_ids but HF BART reserves rows 0–1,
so apply the stored offset before indexing: when adding embeddings in the
encoder and decoder (where you do hidden_states = hidden_states +
self.embed_positions(position_ids)), replace position_ids with position_ids +
position_id_offset (use the existing position_id_offset = 2 variable) so
embed_positions is indexed with position_ids + position_id_offset; update both
occurrences (the encoder use around the embed_positions call and the decoder use
around its embed_positions call) and keep the subsequent layernorm_embedding
calls unchanged.

Comment on lines +3265 to +3279
try:
self.encoder_stream.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(self.encoder_stream):
encoder_hidden_states, encoder_seq_lens = (
self.model_engine.forward_encoder(
encoder_requests,
resource_manager=self.resource_manager,
))
except Exception as e:
traceback.print_exc()
error_msg = str(e)
logger.error(
f"Encountered an error in encoder forward: {error_msg}")
self._handle_errors(error_msg, requests=encoder_requests)
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

# Get the file and examine the encoder forward section and _handle_errors method
wc -l tensorrt_llm/_torch/pyexecutor/py_executor.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 114


🏁 Script executed:

# Read the encoder forward section
sed -n '3260,3330p' tensorrt_llm/_torch/pyexecutor/py_executor.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 3202


🏁 Script executed:

# Find the _handle_errors method definition to understand its scope
rg -n "_handle_errors" tensorrt_llm/_torch/pyexecutor/py_executor.py | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 905


🏁 Script executed:

# Search for other exception handling patterns in the file to see how errors are typically caught
rg -n "except.*:" tensorrt_llm/_torch/pyexecutor/py_executor.py | head -30

Repository: NVIDIA/TensorRT-LLM

Length of output: 662


🏁 Script executed:

# Read the _handle_errors method definition
sed -n '3935,3970p' tensorrt_llm/_torch/pyexecutor/py_executor.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1995


🏁 Script executed:

# Search for forward_encoder definition and its docstring
rg -B 5 -A 20 "def forward_encoder" tensorrt_llm/_torch/pyexecutor/py_executor.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Compare with decoder forward error handling (look for _run_decoder_step or similar)
rg -B 5 -A 25 "def _run_decoder_step" tensorrt_llm/_torch/pyexecutor/py_executor.py | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Look at the decoder forward call to see if it also has broad exception handling
sed -n '3197,3280p' tensorrt_llm/_torch/pyexecutor/py_executor.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 4066


🏁 Script executed:

# Search for ModelEngine class and its forward_encoder method to understand what exceptions it raises
rg -r "def forward_encoder" --type py

Repository: NVIDIA/TensorRT-LLM

Length of output: 460


🏁 Script executed:

# Let's also check if there are docstrings or type hints for forward_encoder
rg -B 3 -A 15 "def forward_encoder" --type py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1760


🏁 Script executed:

# Check what exceptions are documented as being raised from encoder operations
rg "raises?" --type py tensorrt_llm/_torch/pyexecutor/py_executor.py | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 745


🏁 Script executed:

# Look at other decoder forward calls to see the error handling pattern for comparison
sed -n '3100,3180p' tensorrt_llm/_torch/pyexecutor/py_executor.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 4353


🏁 Script executed:

# Check the full forward_encoder docstring and implementation
sed -n '1,100p' tensorrt_llm/_torch/pyexecutor/model_engine.py | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 2422


🏁 Script executed:

# Find where forward_encoder is defined and read its full docstring
fd model_engine.py --type f

Repository: NVIDIA/TensorRT-LLM

Length of output: 170


🏁 Script executed:

# Get the full forward_encoder implementation to see what exceptions it may raise
rg -A 40 "def forward_encoder" tensorrt_llm/_torch/pyexecutor/model_engine.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1885


🏁 Script executed:

# Check if there are specific exceptions documented or commonly raised in encoder paths
rg -B 2 -A 2 "AssertionError|RuntimeError|AttributeError" tensorrt_llm/_torch/pyexecutor/model_engine.py | head -40

Repository: NVIDIA/TensorRT-LLM

Length of output: 1785


🏁 Script executed:

# Let's see what exceptions are documented or raised in _prepare_tp_inputs_encoder and _forward_step_encoder
rg -A 30 "def _prepare_tp_inputs_encoder" tensorrt_llm/_torch/pyexecutor/model_engine.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1408


🏁 Script executed:

# Check what exceptions forward() (decoder) catches for comparison
rg -B 5 -A 20 "def forward\(" tensorrt_llm/_torch/pyexecutor/model_engine.py | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 2415


🏁 Script executed:

# Look at how decoder forward errors are handled in py_executor
sed -n '3654,3700p' tensorrt_llm/_torch/pyexecutor/py_executor.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2261


🏁 Script executed:

# Check the CODING_GUIDELINES to confirm the exact requirement
cat tensorrt_llm/CODING_GUIDELINES.md | grep -A 5 "broad exception"

Repository: NVIDIA/TensorRT-LLM

Length of output: 129


🏁 Script executed:

# Find the CODING_GUIDELINES file
find . -name "*CODING*" -type f

Repository: NVIDIA/TensorRT-LLM

Length of output: 86


🏁 Script executed:

# Look at how decoder forward errors are typically handled - check other broad Exception catches
sed -n '3800,3850p' tensorrt_llm/_torch/pyexecutor/py_executor.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2861


🏁 Script executed:

# Compare exception handling in decoder forward vs encoder forward
sed -n '2415,2435p' tensorrt_llm/_torch/pyexecutor/py_executor.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1051


🏁 Script executed:

# Check if there are any comments or documentation about expected exceptions
rg "expected.*exception|raise.*error|error.*handling" tensorrt_llm/_torch/pyexecutor/model_engine.py | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Check if the decoder forward at line 3800 also handles it differently or uses charge_budget=False
sed -n '3785,3815p' tensorrt_llm/_torch/pyexecutor/py_executor.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1569


🏁 Script executed:

# Check line 2108 where _handle_errors is called with charge_budget
sed -n '2100,2115p' tensorrt_llm/_torch/pyexecutor/py_executor.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 1021


🏁 Script executed:

# Search for uses of charge_budget=False to understand expected pattern
rg "charge_budget.*False" tensorrt_llm/_torch/pyexecutor/py_executor.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 432


🏁 Script executed:

# Check what specific exceptions _prepare_tp_inputs_encoder might raise
sed -n '1,150p' tensorrt_llm/_torch/pyexecutor/model_engine.py | grep -A 100 "_prepare_tp_inputs_encoder" | head -80

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Verify the specific exceptions that should be caught by examining what forward_encoder actually raises
rg -A 80 "def _forward_step_encoder" tensorrt_llm/_torch/pyexecutor/model_engine.py | head -100

Repository: NVIDIA/TensorRT-LLM

Length of output: 3740


🏁 Script executed:

# Check the coding guideline file content
sed -n '1,50p' CODING_GUIDELINES.md

Repository: NVIDIA/TensorRT-LLM

Length of output: 2714


🏁 Script executed:

# Search for the specific guideline about exception handling
rg "Avoid broad exception" CODING_GUIDELINES.md

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Search for the Python exception handling guideline
rg -A 3 "Avoid broad exception handling" CODING_GUIDELINES.md

Repository: NVIDIA/TensorRT-LLM

Length of output: 45


🏁 Script executed:

# Search for all Python-specific guidelines about exceptions
rg -B 2 -A 3 "exception|except" CODING_GUIDELINES.md | head -40

Repository: NVIDIA/TensorRT-LLM

Length of output: 3310


🏁 Script executed:

# Get the specific section about exception handling in Python
sed -n '200,400p' CODING_GUIDELINES.md

Repository: NVIDIA/TensorRT-LLM

Length of output: 7446


🏁 Script executed:

# Search for Python exception handling guidelines more broadly
sed -n '1,2000p' CODING_GUIDELINES.md | grep -A 5 -B 2 "exception\|except" | head -60

Repository: NVIDIA/TensorRT-LLM

Length of output: 4031


🏁 Script executed:

# Check if there's a specific Python section in the guidelines
grep -n "^#.*Python" CODING_GUIDELINES.md

Repository: NVIDIA/TensorRT-LLM

Length of output: 120


🏁 Script executed:

# Get the Python section directly
sed -n '/^## Python/,/^##/p' CODING_GUIDELINES.md | head -200

Repository: NVIDIA/TensorRT-LLM

Length of output: 209


🏁 Script executed:

# Get the full Python section
sed -n '479,800p' CODING_GUIDELINES.md

Repository: NVIDIA/TensorRT-LLM

Length of output: 13333


Catch only specific, request-scoped exceptions in the encoder forward path.

The broad except Exception at line 3273 violates the coding guideline "When using try-except blocks, limit the except to the smallest set of errors possible" and masks internal errors like AttributeError (missing encoder submodule) that should propagate to fail fast. Catch only ValueError for invalid input errors and let other exceptions (RuntimeError, AttributeError, etc.) propagate so the executor detects corrupted state.

🧰 Tools
🪛 Ruff (0.15.12)

[warning] 3273-3273: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 3265 - 3279,
Replace the broad "except Exception" around the encoder forward block so only
input-validation errors are caught: change the handler to "except ValueError as
e:" (keeping traceback.print_exc(), logger.error(f'Encountered an error in
encoder forward: {error_msg}'), calling self._handle_errors(error_msg,
requests=encoder_requests) and returning) and remove/avoid catching other
exception types so RuntimeError/AttributeError and other internal faults from
self.model_engine.forward_encoder, self.encoder_stream, or missing submodules
propagate and fail fast; keep references to encoder_stream,
model_engine.forward_encoder, resource_manager, encoder_requests, and
_handle_errors to locate and implement the change.

Comment on lines +3316 to +3329
assert len(encoder_seq_lens) == len(encoder_requests), (
"Encoder packed sequence lengths must match the number of "
"encoder requests")
assert encoder_hidden_states.shape[0] == sum(encoder_seq_lens), (
"Encoder packed hidden states first dim must equal "
"sum(encoder_seq_lens)")

offset = 0
for req, seq_len in zip(encoder_requests, encoder_seq_lens):
req.py_encoder_output = encoder_hidden_states[offset:offset +
seq_len]
req.py_skip_cross_kv_projection = False
req.state = LlmRequestState.CONTEXT_INIT
offset += seq_len
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and examine the lines in question
file_path="tensorrt_llm/_torch/pyexecutor/py_executor.py"
if [ -f "$file_path" ]; then
    echo "=== File found, checking line count ==="
    wc -l "$file_path"
    echo ""
    echo "=== Code at lines 3316-3329 ==="
    sed -n '3316,3329p' "$file_path"
else
    echo "File not found: $file_path"
fi

Repository: NVIDIA/TensorRT-LLM

Length of output: 897


🏁 Script executed:

# Check Python version requirements in the project
echo "=== Checking setup.py or pyproject.toml for Python version requirements ==="
if [ -f "setup.py" ]; then
    grep -A5 -B5 "python_requires\|Programming Language" setup.py | head -20
fi
if [ -f "pyproject.toml" ]; then
    grep -A2 -B2 "python\|requires-python" pyproject.toml | head -20
fi
echo ""
echo "=== Checking CODING_GUIDELINES.md ==="
grep -i "python.*3\.[0-9]" CODING_GUIDELINES.md || echo "No version requirement found in CODING_GUIDELINES"

Repository: NVIDIA/TensorRT-LLM

Length of output: 1835


🏁 Script executed:

# Verify that zip(strict=True) is indeed a Python 3.10+ feature
python3 << 'EOF'
import sys
print(f"Python version: {sys.version}")
print(f"Python version info: {sys.version_info}")

# Check if strict parameter exists in zip
import inspect
sig = inspect.signature(zip)
print(f"zip() signature: {sig}")
print(f"zip() has 'strict' parameter: {'strict' in sig.parameters}")
EOF

Repository: NVIDIA/TensorRT-LLM

Length of output: 864


🏁 Script executed:

# Test if strict=True works in current Python version and check documentation
python3 << 'EOF'
import sys

# Test if strict parameter works
try:
    result = list(zip([1, 2, 3], [4, 5], strict=True))
    print("strict=True parameter works in this Python version")
    print(f"Result: {result}")
except TypeError as e:
    print(f"strict=True parameter NOT supported: {e}")

# Check help for zip
print("\n=== zip() documentation ===")
help(zip)
EOF

Repository: NVIDIA/TensorRT-LLM

Length of output: 192


🏁 Script executed:

# Check Python version history for zip strict parameter
python3 << 'EOF'
import sys
print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
print(f"Python 3.10 released: October 2021")
print(f"Python 3.10 (initial): strict parameter added to zip()")
print()
# Verify in actual Python 3.10+ documentation or WHATSNEW file if available
# The strict parameter was indeed added in Python 3.10
print("According to Python documentation:")
print("- Python 3.10: zip() has strict parameter")
print("- Source: https://docs.python.org/3/library/functions.html#zip")
EOF

Repository: NVIDIA/TensorRT-LLM

Length of output: 315


🏁 Script executed:

# Check if there are other uses of assert in the same file that might be problematic
echo "=== Searching for assert statements in py_executor.py ==="
rg -n "^\s*assert\s+" tensorrt_llm/_torch/pyexecutor/py_executor.py | head -20
echo ""
echo "=== Total count of assert statements ==="
rg -c "^\s*assert\s+" tensorrt_llm/_torch/pyexecutor/py_executor.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 919


🏁 Script executed:

# Look at the context around line 3316 to understand the function and importance
echo "=== Context: lines 3300-3340 ==="
sed -n '3300,3340p' tensorrt_llm/_torch/pyexecutor/py_executor.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 2206


Use explicit runtime checks instead of asserts for encoder output distribution.

These invariants currently depend on assert, which can be disabled when Python runs with optimization flags (-O/-OO). When disabled, the plain zip() call will silently truncate if the lists have different lengths, leaving a suffix of requests without py_encoder_output and without the required state transition to CONTEXT_INIT—a difficult-to-debug production issue.

Replace asserts with explicit if checks raising RuntimeError, and add strict=True to the zip() call to catch length mismatches at runtime:

Suggested fix
-        assert len(encoder_seq_lens) == len(encoder_requests), (
-            "Encoder packed sequence lengths must match the number of "
-            "encoder requests")
-        assert encoder_hidden_states.shape[0] == sum(encoder_seq_lens), (
-            "Encoder packed hidden states first dim must equal "
-            "sum(encoder_seq_lens)")
+        if len(encoder_seq_lens) != len(encoder_requests):
+            raise RuntimeError(
+                "Encoder packed sequence lengths must match the number of "
+                "encoder requests")
+        if encoder_hidden_states.shape[0] != sum(encoder_seq_lens):
+            raise RuntimeError(
+                "Encoder packed hidden states first dim must equal "
+                "sum(encoder_seq_lens)")
@@
-        for req, seq_len in zip(encoder_requests, encoder_seq_lens):
+        for req, seq_len in zip(
+                encoder_requests, encoder_seq_lens, strict=True):
🧰 Tools
🪛 Ruff (0.15.12)

[warning] 3324-3324: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 3316 - 3329,
Replace the two assert checks with explicit runtime validation: verify
len(encoder_seq_lens) == len(encoder_requests) and
encoder_hidden_states.shape[0] == sum(encoder_seq_lens); if either fails, raise
a RuntimeError with a clear message. Also iterate using zip(encoder_requests,
encoder_seq_lens, strict=True) to ensure length mismatches raise immediately,
then continue assigning req.py_encoder_output, req.py_skip_cross_kv_projection =
False and setting req.state = LlmRequestState.CONTEXT_INIT as before (use the
same symbols encoder_seq_lens, encoder_requests, encoder_hidden_states,
py_encoder_output, py_skip_cross_kv_projection, LlmRequestState.CONTEXT_INIT).

Comment on lines +465 to +468
cross_action = self._try_schedule_cross_context(req)
if cross_action is not ScheduleAction.SCHEDULED:
self._suspend_request(req)
return cross_action, 0, False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Roll back the cross pool too when decoder-context admission fails.

By this point the request has already resized the self-KV cache, and _try_schedule_cross_context() may also have restored or allocated cross-KV state. Calling _suspend_request(req) only frees the main and draft managers, so cross-pool pages stay pinned and later encoder-decoder admissions can stall once that pool fills.

Suggested fix
     def _suspend_request(self, req: LlmRequest) -> None:
         """Suspend a request's KV cache in both main and draft managers.
@@
         self.kv_cache_manager.suspend_request(req)
         if self.draft_kv_cache_manager is not None:
             self.draft_kv_cache_manager.suspend_request(req)
+        if self.cross_kv_cache_manager is not None:
+            self.cross_kv_cache_manager.suspend_request(req)

Also applies to: 531-534

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/scheduler/scheduler_v2.py` around lines 465 -
468, The failure path after calling _try_schedule_cross_context(req) must undo
any cross-pool KV allocations before returning; modify the two failure sites
(the block that checks cross_action != ScheduleAction.SCHEDULED) to call a
rollback/free helper for cross-context state (e.g. add a call to a new or
existing _rollback_cross_context(req) or extend _suspend_request(req) to free
cross-pool pages) so that any allocations or pins made by
_try_schedule_cross_context are released prior to returning; apply the same
change at the other matching location that checks cross_action (the one around
the secondary occurrence at lines 531-534).

Comment on lines 383 to 386
def can_schedule(self, requests: RequestList) -> bool:
requests = filter_unready_decoder_context_requests(requests)
fitting_requests, _, _ = self.capacity_scheduler.schedule_request(requests)
return len(fitting_requests) == len(requests)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

can_schedule() now returns True for filtered-out requests.

If the only input is an unready CONTEXT_INIT request, filter_unready_decoder_context_requests() turns it into [], and the length check succeeds. That makes can_schedule([req]) report schedulable even though the request is explicitly blocked on encoder output readiness.

♻️ Suggested fix
     def can_schedule(self, requests: RequestList) -> bool:
-        requests = filter_unready_decoder_context_requests(requests)
-        fitting_requests, _, _ = self.capacity_scheduler.schedule_request(requests)
-        return len(fitting_requests) == len(requests)
+        ready_requests = filter_unready_decoder_context_requests(requests)
+        if len(ready_requests) != len(requests):
+            return False
+        fitting_requests, _, _ = self.capacity_scheduler.schedule_request(
+            ready_requests
+        )
+        return len(fitting_requests) == len(ready_requests)

Also applies to: 1830-1834

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/scheduler/scheduler.py` around lines 383 -
386, can_schedule currently filters out unready CONTEXT_INIT requests before
comparing lengths, causing it to return True when all original requests were
filtered out; update can_schedule to first capture the original requests list,
run filter_unready_decoder_context_requests, and immediately return False if any
requests were filtered out (i.e., len(filtered) != len(original)); otherwise
call self.capacity_scheduler.schedule_request(filtered) and return True only if
len(fitting_requests) == len(original) (or len(filtered) since they’re equal),
referencing can_schedule, filter_unready_decoder_context_requests, and
capacity_scheduler.schedule_request to locate the logic to change.

Comment on lines 98 to +101
assert inputs.get("prompt") is not None \
or inputs.get("prompt_token_ids") is not None
or inputs.get("prompt_token_ids") is not None \
or inputs.get("encoder_inputs") is not None \
or inputs.get("encoder_input_token_ids") is not None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Accept decoder_input_token_ids in dict prompts.

This guard still rejects a request that only provides decoder_input_token_ids, so one of the new encoder-decoder prompt shapes fails before it reaches the executor.

Suggested fix
         assert inputs.get("prompt") is not None \
             or inputs.get("prompt_token_ids") is not None \
             or inputs.get("encoder_inputs") is not None \
-            or inputs.get("encoder_input_token_ids") is not None
+            or inputs.get("encoder_input_token_ids") is not None \
+            or inputs.get("decoder_input_token_ids") is not None
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/inputs/data.py` around lines 98 - 101, The current assertion in
the inputs validation checks for "prompt", "prompt_token_ids", "encoder_inputs",
or "encoder_input_token_ids" but misses the new decoder-only shape; update the
guard to also accept inputs.get("decoder_input_token_ids") (i.e., add an OR
inputs.get("decoder_input_token_ids") is not None) so that requests supplying
only decoder_input_token_ids pass validation; modify the assertion line that
references inputs.get("prompt") / "prompt_token_ids" / "encoder_inputs" /
"encoder_input_token_ids" accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant