[https://nvbugs/6143599][fix] Re-apply proven fix from commit 295615d8bf (not present in HEAD): subtract 2× pr by tensorrt-cicd · Pull Request #13915 · NVIDIA/TensorRT-LLM

tensorrt-cicd · 2026-05-08T19:36:35Z

Summary

Root cause: KV cache budget sized from a small dummy profiling request (2.62 GiB activations) leaves no headroom for real batch=1024+ MoE permute/DeepGEMM workspaces, so PyTorch cap reaches physical GPU OOM during inference.
Fix: Re-apply proven fix from commit 295615d (not present in HEAD): subtract 2× profiled activation_bytes from kv_cache_max_memory in configure_kv_cache_capacity, and point extract_stress_test_metrics() default artifacts_dir at os.getcwd()/artifacts to match where aiperf writes.
Automated fix generated by repair-bot

Test plan

Verify fix on the same GPU type as the original failure
Check for regressions in related tests

Links

Bug: https://nvbugs/6143599

Summary by CodeRabbit

Bug Fixes
- Improved KV-cache memory management by reserving additional headroom for peak activations and workspace allocations, preventing potential out-of-memory errors.
Configuration
- Updated artifact directory defaults for stress tests to use the current working directory for better consistency.
Tests
- Updated stress test waivers to reflect current configurations.

…d fix stress test artifacts directory The DeepSeek-V3 tp8 stress test on B200 was failing with CUDA OOM during high-concurrency inference. Two issues: 1. KV cache budget calculation did not account for dynamic activation memory that scales with batch size. Profiling captures activations for a small dummy request (2.6 GiB), but runtime activations (MoE permute buffers, MLA KV projections) at full batch size are significantly larger. Reserve 2x the profiled activation memory from the KV cache budget to prevent OOM under sustained load. 2. extract_stress_test_metrics() looked for aiperf artifacts relative to the script file location, but aiperf writes them relative to the current working directory. Use os.getcwd() instead. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

coderabbitai · 2026-05-08T19:39:45Z

📝 Walkthrough

Walkthrough

KV cache memory allocation now reserves headroom for peak activation memory by subtracting twice the profiled activation bytes from the cache budget. Test infrastructure defaults artifact directories to the current working directory instead of script-relative paths, and stress test waivers are updated to remove the GUARANTEED_NO_EVICT variant for DeepSeek-V3.

Changes

KV Cache and Test Stability

Layer / File(s)	Summary
KV Cache Memory Reservation `tensorrt_llm/_torch/pyexecutor/_util.py`	`KvCacheCreator.configure_kv_cache_capacity` subtracts `2 * activation_bytes` from estimated `kv_cache_max_memory` (when `activation_bytes > 0`), clamps to zero, and logs reserved and adjusted KV-cache budget.
Test Artifact Path and Waivers `tests/integration/defs/stress_test/stress_test.py`, `tests/integration/test_lists/waives.txt`	`extract_stress_test_metrics()` defaults `artifacts_dir` to current working directory instead of script-relative path; waiver for `DeepSeek-V3_tp8` with `GUARANTEED_NO_EVICT` scheduler is removed while `MAX_UTILIZATION` variant remains.

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title is cut off mid-word ('subtract 2× pr' instead of complete phrase) and lacks the complete fix description, making it incomplete and unclear.	Complete the title with the full description, e.g., '[https://nvbugs/6143599][fix] Re-apply fix from commit `295615d`: subtract 2× activation bytes from KV cache budget'.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description adequately explains the root cause, the fix applied, and includes test coverage verification and relevant links, though the PR checklist is not completed.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/integration/defs/stress_test/stress_test.py (1)
1-1: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update the NVIDIA copyright year on this modified file.

This file was modified, but the header still ends at 2024.
Suggested fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines: “Include NVIDIA copyright header on all new files; update year on modified files” and “All C++, Python, and other source files must contain NVIDIA copyright header with current modification year”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/defs/stress_test/stress_test.py` at line 1, Update the
copyright header in tests/integration/defs/stress_test/stress_test.py to include
the current modification year (replace "2024" with the current year) so the file
header matches the project's copyright guidelines; locate the top-of-file SPDX
header line and update the year range accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/integration/defs/stress_test/stress_test.py`:
- Line 1: Update the copyright header in
tests/integration/defs/stress_test/stress_test.py to include the current
modification year (replace "2024" with the current year) so the file header
matches the project's copyright guidelines; locate the top-of-file SPDX header
line and update the year range accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 00837d04-0ed9-49bb-9a21-c8dc482cf9a4

📥 Commits

Reviewing files that changed from the base of the PR and between f8572ab and 27c853c.

📒 Files selected for processing (3)

tensorrt_llm/_torch/pyexecutor/_util.py
tests/integration/defs/stress_test/stress_test.py
tests/integration/test_lists/waives.txt

💤 Files with no reviewable changes (1)

tests/integration/test_lists/waives.txt

tensorrt-cicd requested review from a team as code owners May 8, 2026 19:36

tensorrt-cicd assigned dominicshanshan May 8, 2026

tensorrt-cicd requested a review from byshiue May 8, 2026 19:36

github-actions Bot assigned tensorrt-cicd May 8, 2026

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6143599][fix] Re-apply proven fix from commit 295615d8bf (not present in HEAD): subtract 2× pr#13915

[https://nvbugs/6143599][fix] Re-apply proven fix from commit 295615d8bf (not present in HEAD): subtract 2× pr#13915
tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6143599

tensorrt-cicd commented May 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026

Walkthrough

Changes

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tensorrt-cicd commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 8, 2026

Walkthrough

Changes

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tensorrt-cicd commented May 8, 2026 •

edited by coderabbitai Bot

Loading