[https://nvbugs/6094107][fix] Exclude PP send/recv from piecewise CUDA graph capture#13296
[https://nvbugs/6094107][fix] Exclude PP send/recv from piecewise CUDA graph capture#13296tensorrt-cicd wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
…A graph capture The piecewise CUDA graph optimizer was capturing NCCL point-to-point communication ops (pp_send_tensors/pp_recv_tensors) inside CUDA graph sections. When replayed across pipeline-parallel ranks, these captured NCCL operations could intermittently deadlock, causing the PP4 + torch_compile + piecewise CUDA graph configuration to hang. Add pp_send_tensors and pp_recv_tensors as graph-break points in the piecewise optimizer so they always run eagerly, similar to how attention custom ops are already excluded. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughUpdated the piecewise optimizer's graph partitioning logic to always exclude Pipeline Parallelism point-to-point communication operations from CUDA graph capture, regardless of the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/bot run --add-multi-gpu-test |
|
PR_Github #44854 [ run ] triggered by Bot. Commit: |
|
PR_Github #44854 [ run ] completed with state
|
| if (not stop_partition and is_call_function(node, [ | ||
| # PP send/recv must always be excluded from CUDA graph capture | ||
| # regardless of stop_partition, because capturing NCCL point-to-point | ||
| # communication in CUDA graphs can cause intermittent deadlocks. |
There was a problem hiding this comment.
The explanation does not make sense as the normal cuda-graph does not exclude the pp_send and pp_recv. There could be deeper issue on that.
|
close since we cannot reproduce this bug with latest main, see #13891 |
Summary
Test plan
Links
Summary by CodeRabbit