[#13909][fix] Reuse hidden_states buffer across CUDA graph captures in Eagle3#13920
[#13909][fix] Reuse hidden_states buffer across CUDA graph captures in Eagle3#13920ml-inference wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
…Eagle3 Previously, Eagle3OneModelSpecMetadata allocated a new hidden_states buffer (max_num_tokens × hidden_size × num_capture_layers) per CUDA graph capture. This caused memory to grow linearly with the number of graphs captured, wasting ~2.75 GiB for 16 captures on Qwen3-235B. Fix: Always create Eagle3ResourceManager in the one-model flow and reuse its pre-allocated hidden_states buffer instead of allocating a new one per capture. Fixes: NVIDIA#13909 Signed-off-by: Spurthi Sandiri <spurthi@amazon.com>
📝 WalkthroughWalkthroughThis PR fixes excessive GPU memory allocation during CUDA graph capture in Eagle3 speculative decoding. The resource manager now unconditionally returns an ChangesEagle3 Memory Reuse During Graph Capture
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (1)
1-1:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd/update NVIDIA copyright header in this modified source file
This file is modified but does not show the required NVIDIA copyright header/current modification year.
As per coding guidelines, "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py` at line 1, This file (starts with the import bisect statement) is missing the required NVIDIA copyright header; add the standard NVIDIA copyright/header block at the very top of tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (above the existing import bisect), update the modification year to the current year, and ensure the header formatting matches other project source files so linters and legal checks recognize it.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py`:
- Around line 394-399: The log line in capture uses self.config.mapping.rank
unguarded which can raise AttributeError when mapping is None; update the
logger.debug call in the cuda_graph_runner capture path to safely read rank
(e.g., use a conditional/getattr to produce a safe default like "None" or -1
when self.config.mapping is absent) while keeping the rest of the fields
(self.config.is_draft_model, len(self.graphs), torch.cuda.memory_allocated())
unchanged; locate the logger.debug invocation in the class/method that performs
cuda graph capture and replace the direct .rank access with the guarded
expression.
In `@tensorrt_llm/_torch/speculative/eagle3.py`:
- Line 372: The assignment self.hidden_states =
self.spec_resource_manager.hidden_states fails in dynamic-tree one-model mode
because Eagle3OneModelDynamicTreeResourceManager lacks hidden_states; change the
initialization to safely handle both managers by checking for the attribute
(e.g., use getattr(self.spec_resource_manager, "hidden_states", None)) or by
conditional logic that only sets self.hidden_states when spec_resource_manager
exposes it so metadata init doesn't raise AttributeError.
---
Outside diff comments:
In `@tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py`:
- Line 1: This file (starts with the import bisect statement) is missing the
required NVIDIA copyright header; add the standard NVIDIA copyright/header block
at the very top of tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py (above
the existing import bisect), update the modification year to the current year,
and ensure the header formatting matches other project source files so linters
and legal checks recognize it.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d7902bb3-cc50-4f06-a8ca-42f9b27649fc
📒 Files selected for processing (3)
tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.pytensorrt_llm/_torch/speculative/eagle3.pytensorrt_llm/_torch/speculative/utils.py
| logger.debug( | ||
| f"[Memory-Debug][rank={self.config.mapping.rank}][cuda_graph:capture] " | ||
| f"key={key} is_draft_model={self.config.is_draft_model} " | ||
| f"graphs_so_far={len(self.graphs)} " | ||
| f"torch_memory_allocated={torch.cuda.memory_allocated()/1024**3:.2f}GiB" | ||
| ) |
There was a problem hiding this comment.
Guard optional mapping before logging rank
self.config.mapping is optional, but Line 395 dereferences .rank unconditionally. That can crash capture with AttributeError when mapping is absent.
Proposed fix
- logger.debug(
- f"[Memory-Debug][rank={self.config.mapping.rank}][cuda_graph:capture] "
+ rank = self.config.mapping.rank if self.config.mapping is not None else -1
+ logger.debug(
+ f"[Memory-Debug][rank={rank}][cuda_graph:capture] "
f"key={key} is_draft_model={self.config.is_draft_model} "
f"graphs_so_far={len(self.graphs)} "
f"torch_memory_allocated={torch.cuda.memory_allocated()/1024**3:.2f}GiB"
)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py` around lines 394 - 399,
The log line in capture uses self.config.mapping.rank unguarded which can raise
AttributeError when mapping is None; update the logger.debug call in the
cuda_graph_runner capture path to safely read rank (e.g., use a
conditional/getattr to produce a safe default like "None" or -1 when
self.config.mapping is absent) while keeping the rest of the fields
(self.config.is_draft_model, len(self.graphs), torch.cuda.memory_allocated())
unchanged; locate the logger.debug invocation in the class/method that performs
cuda graph capture and replace the direct .rank access with the guarded
expression.
| self.hidden_size * len(self.layers_to_capture)), | ||
| dtype=self.dtype, | ||
| device='cuda') | ||
| self.hidden_states = self.spec_resource_manager.hidden_states |
There was a problem hiding this comment.
Critical: Line 372 breaks dynamic-tree one-model mode (AttributeError).
self.spec_resource_manager.hidden_states is accessed unconditionally, but dynamic-tree mode uses Eagle3OneModelDynamicTreeResourceManager (no hidden_states field). This can fail during metadata init.
Proposed fix
- self.hidden_states = self.spec_resource_manager.hidden_states
+ if isinstance(self.spec_resource_manager, Eagle3ResourceManager):
+ self.hidden_states = self.spec_resource_manager.hidden_states
+ else:
+ # Dynamic-tree manager does not provide hidden_states.
+ # Keep per-metadata allocation for that mode.
+ self.hidden_states = torch.empty(
+ (self.max_num_tokens, self.hidden_size * self.num_capture_layers),
+ dtype=self.dtype,
+ device='cuda',
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/speculative/eagle3.py` at line 372, The assignment
self.hidden_states = self.spec_resource_manager.hidden_states fails in
dynamic-tree one-model mode because Eagle3OneModelDynamicTreeResourceManager
lacks hidden_states; change the initialization to safely handle both managers by
checking for the attribute (e.g., use getattr(self.spec_resource_manager,
"hidden_states", None)) or by conditional logic that only sets
self.hidden_states when spec_resource_manager exposes it so metadata init
doesn't raise AttributeError.
Description
During CUDA graph capture in the Eagle3 one-model flow,
Eagle3OneModelSpecMetadataallocated a newhidden_statesbuffer(
max_num_tokens × hidden_size × num_capture_layers) per graph capture. This caused memory to grow linearly with the numberof graphs captured, wasting ~2.75 GiB for 16 captures on Qwen3-235B.
Fix: Always create
Eagle3ResourceManagerin the one-model flow (previously only created whensa_managerwas not None)and reuse its pre-allocated
hidden_statesbuffer instead of allocating a new one per capture.Fixes: #13909
Test Coverage
Manually validated on Qwen3-235B-A22B with FP8 KV cache, 16 CUDA graph captures:
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
[x ] Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.Summary by CodeRabbit
Release Notes
Refactor
Chores