[https://nvbugs/6143599][fix] Re-apply proven fix from commit 295615d8bf (not present in HEAD): subtract 2× pr#13915
Conversation
…d fix stress test artifacts directory The DeepSeek-V3 tp8 stress test on B200 was failing with CUDA OOM during high-concurrency inference. Two issues: 1. KV cache budget calculation did not account for dynamic activation memory that scales with batch size. Profiling captures activations for a small dummy request (2.6 GiB), but runtime activations (MoE permute buffers, MLA KV projections) at full batch size are significantly larger. Reserve 2x the profiled activation memory from the KV cache budget to prevent OOM under sustained load. 2. extract_stress_test_metrics() looked for aiperf artifacts relative to the script file location, but aiperf writes them relative to the current working directory. Use os.getcwd() instead. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
📝 WalkthroughWalkthroughKV cache memory allocation now reserves headroom for peak activation memory by subtracting twice the profiled activation bytes from the cache budget. Test infrastructure defaults artifact directories to the current working directory instead of script-relative paths, and stress test waivers are updated to remove the GUARANTEED_NO_EVICT variant for DeepSeek-V3. ChangesKV Cache and Test Stability
🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/integration/defs/stress_test/stress_test.py (1)
1-1:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUpdate the NVIDIA copyright year on this modified file.
This file was modified, but the header still ends at 2024.
Suggested fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.As per coding guidelines: “Include NVIDIA copyright header on all new files; update year on modified files” and “All C++, Python, and other source files must contain NVIDIA copyright header with current modification year”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/integration/defs/stress_test/stress_test.py` at line 1, Update the copyright header in tests/integration/defs/stress_test/stress_test.py to include the current modification year (replace "2024" with the current year) so the file header matches the project's copyright guidelines; locate the top-of-file SPDX header line and update the year range accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@tests/integration/defs/stress_test/stress_test.py`:
- Line 1: Update the copyright header in
tests/integration/defs/stress_test/stress_test.py to include the current
modification year (replace "2024" with the current year) so the file header
matches the project's copyright guidelines; locate the top-of-file SPDX header
line and update the year range accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 00837d04-0ed9-49bb-9a21-c8dc482cf9a4
📒 Files selected for processing (3)
tensorrt_llm/_torch/pyexecutor/_util.pytests/integration/defs/stress_test/stress_test.pytests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
- tests/integration/test_lists/waives.txt
Summary
Test plan
Links
Summary by CodeRabbit
Bug Fixes
Configuration
Tests