[https://nvbugs/6163030][fix] Replace moe_ep_size * max_tokens with max(moe_ep_size, dp_size) * max_tokens#14023
Conversation
When attention_dp is enabled with moe_tp_size > 1 (TP-MoE), the MoE communication strategy is AllGatherReduceScatter, whose dispatch output is [dp_size * max_tokens_per_rank, hidden]. The workspace/chunk sizing previously used moe_ep_size * max_tokens_per_rank, which under-allocates when moe_ep_size < dp_size (e.g. tp_size=4, ep_size=1 → moe_ep_size=1, dp_size=4). DeepGemmFusedMoE.run_moe then tripped set_strides with a storage size 4x smaller than required. Use max(moe_ep_size, dp_size) * max_tokens_per_rank in the three call sites (single-chunk workspace, multi-chunk workspace, and chunk count), so AllGatherReduceScatter is sized correctly while DeepEP-family comms (where moe_ep_size == dp_size under attention_dp) are unchanged. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR consolidates MOE workspace sizing logic by introducing a centralized dispatch row-count helper method and applying it consistently across scheduler and configurable MOE modules. The sizing formula now uniformly accounts for both data-parallel size and expert-parallel size when calculating buffer dimensions under different communication strategies. ChangesMOE Workspace Sizing Alignment
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`:
- Around line 395-405: The current branch in calculate_num_chunks() computes
num_rows using max(self.mapping.moe_ep_size, self.mapping.dp_size) *
max(all_rank_num_tokens), which diverges from the low-latency dispatch sizing;
replace that duplicated logic by calling the shared helper
ExternalCommMoEScheduler._dispatched_rows(...) (the same helper used to size
DeepEPLowLatency) so both comm modes use num_slots * max_tokens_per_rank
consistently; ensure you pass the same arguments (mapping, all_rank_num_tokens,
comm/use_dp context) to _dispatched_rows() and remove the bespoke max(...)
expression to keep chunk counts and DeepGemm workspace sizing aligned.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a5c55e86-db31-419b-9705-6ba120839d3e
📒 Files selected for processing (2)
tensorrt_llm/_torch/modules/fused_moe/configurable_moe.pytensorrt_llm/_torch/modules/fused_moe/moe_scheduler.py
| Uses ``max(ep_size, dp_size) * max(all_rank_num_tokens)`` when | ||
| communication is active. AllGatherReduceScatter (selected when | ||
| ``moe_tp_size != 1``) gathers ``dp_size * max_tokens_per_rank`` even | ||
| though ``moe_ep_size`` may be 1, while alltoall-family comms produce | ||
| ``ep_size * max_tokens_per_rank``. Using the max matches the | ||
| workspace sizing in the MoE scheduler. | ||
| """ | ||
| if self.use_dp and self.comm is not None: | ||
| num_rows = self.mapping.moe_ep_size * max(all_rank_num_tokens) | ||
| num_rows = max(self.mapping.moe_ep_size, self.mapping.dp_size) * max( | ||
| all_rank_num_tokens | ||
| ) |
There was a problem hiding this comment.
Keep calculate_num_chunks() aligned with the low-latency dispatch shape.
ExternalCommMoEScheduler._dispatched_rows() now sizes DeepEPLowLatency as num_slots * max_tokens_per_rank, but this branch still uses max(moe_ep_size, dp_size) * ... for every comm mode. If num_slots is larger, chunking is still underestimated on the low-latency path, so chunk count and DeepGemm workspace sizing diverge again. Please route both call sites through the same dispatched-row helper instead of duplicating the non-low-latency formula.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py` around lines 395 -
405, The current branch in calculate_num_chunks() computes num_rows using
max(self.mapping.moe_ep_size, self.mapping.dp_size) * max(all_rank_num_tokens),
which diverges from the low-latency dispatch sizing; replace that duplicated
logic by calling the shared helper
ExternalCommMoEScheduler._dispatched_rows(...) (the same helper used to size
DeepEPLowLatency) so both comm modes use num_slots * max_tokens_per_rank
consistently; ensure you pass the same arguments (mapping, all_rank_num_tokens,
comm/use_dp context) to _dispatched_rows() and remove the bespoke max(...)
expression to keep chunk counts and DeepGemm workspace sizing aligned.
Summary
moe_ep_size * max_tokens_per_rank, which under-allocates for AllGatherReduceScatter dispatch (shape isdp_size * max_tokens_per_rank) whenmoe_tp_size > 1andmoe_ep_size < dp_size.moe_ep_size * max_tokenswithmax(moe_ep_size, dp_size) * max_tokensin_prepare_workspace_deepgemm,_prepare_workspaces_for_chunk, andcalculate_num_chunks; DeepEPLowLatency branch is untouched, DeepEP-family paths are unchanged (they already satisfymoe_ep_size == dp_sizeunder attention_dp).Test plan
Links
Summary by CodeRabbit
Bug Fixes
Refactor