[https://nvbugs/6163030][fix] Replace `moe_ep_size * max_tokens` with `max(moe_ep_size, dp_size) * max_tokens` by tensorrt-cicd · Pull Request #14023 · NVIDIA/TensorRT-LLM

tensorrt-cicd · 2026-05-12T03:01:01Z

Summary

Root cause: Workspace sizing used moe_ep_size * max_tokens_per_rank, which under-allocates for AllGatherReduceScatter dispatch (shape is dp_size * max_tokens_per_rank) when moe_tp_size > 1 and moe_ep_size < dp_size.
Fix: Replace moe_ep_size * max_tokens with max(moe_ep_size, dp_size) * max_tokens in _prepare_workspace_deepgemm, _prepare_workspaces_for_chunk, and calculate_num_chunks; DeepEPLowLatency branch is untouched, DeepEP-family paths are unchanged (they already satisfy moe_ep_size == dp_size under attention_dp).
Automated fix generated by repair-bot

Test plan

Verify fix on the same GPU type as the original failure
Check for regressions in related tests

Links

Bug: https://nvbugs/6163030

Summary by CodeRabbit

Bug Fixes
- Improved buffer and workspace sizing calculations for distributed MOE scenarios, ensuring correct allocation across different communication configurations.
Refactor
- Reorganized internal workspace sizing logic to better account for varied distributed training strategies.

When attention_dp is enabled with moe_tp_size > 1 (TP-MoE), the MoE communication strategy is AllGatherReduceScatter, whose dispatch output is [dp_size * max_tokens_per_rank, hidden]. The workspace/chunk sizing previously used moe_ep_size * max_tokens_per_rank, which under-allocates when moe_ep_size < dp_size (e.g. tp_size=4, ep_size=1 → moe_ep_size=1, dp_size=4). DeepGemmFusedMoE.run_moe then tripped set_strides with a storage size 4x smaller than required. Use max(moe_ep_size, dp_size) * max_tokens_per_rank in the three call sites (single-chunk workspace, multi-chunk workspace, and chunk count), so AllGatherReduceScatter is sized correctly while DeepEP-family comms (where moe_ep_size == dp_size under attention_dp) are unchanged. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

coderabbitai · 2026-05-12T03:05:35Z

📝 Walkthrough

Walkthrough

This PR consolidates MOE workspace sizing logic by introducing a centralized dispatch row-count helper method and applying it consistently across scheduler and configurable MOE modules. The sizing formula now uniformly accounts for both data-parallel size and expert-parallel size when calculating buffer dimensions under different communication strategies.

Changes

MOE Workspace Sizing Alignment

Layer / File(s)	Summary
Dispatch row-count helper method `tensorrt_llm/_torch/modules/fused_moe/moe_scheduler.py`	New `_dispatched_rows(max_tokens_per_rank)` method centralizes the dispatch row-count formula, using `moe.num_slots` for `DeepEPLowLatency` and `max(moe.mapping.moe_ep_size, moe.mapping.dp_size)` for other communication strategies.
DeepGemm workspace sizing with helper `tensorrt_llm/_torch/modules/fused_moe/moe_scheduler.py`	Single-chunk and multi-chunk workspace sizing (`num_rows` and `chunk_size_0`) now derive row counts from `_dispatched_rows()` instead of duplicated comm-specific branching logic.
MOE chunk calculation alignment `tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`	`calculate_num_chunks()` now uses `max(moe_ep_size, dp_size)` for `num_rows` when data-parallel is active with communication, aligning with the centralized helper formula and updated docstring.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: replacing a sizing formula in workspace allocation from `moe_ep_size * max_tokens` to `max(moe_ep_size, dp_size) * max_tokens`, which is the core fix for the bug.
Description check	✅ Passed	The PR description includes all required sections: Summary (root cause and fix details), Test plan (verification steps), and Links (bug reference). All core sections are adequately filled out.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`:
- Around line 395-405: The current branch in calculate_num_chunks() computes
num_rows using max(self.mapping.moe_ep_size, self.mapping.dp_size) *
max(all_rank_num_tokens), which diverges from the low-latency dispatch sizing;
replace that duplicated logic by calling the shared helper
ExternalCommMoEScheduler._dispatched_rows(...) (the same helper used to size
DeepEPLowLatency) so both comm modes use num_slots * max_tokens_per_rank
consistently; ensure you pass the same arguments (mapping, all_rank_num_tokens,
comm/use_dp context) to _dispatched_rows() and remove the bespoke max(...)
expression to keep chunk counts and DeepGemm workspace sizing aligned.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a5c55e86-db31-419b-9705-6ba120839d3e

📥 Commits

Reviewing files that changed from the base of the PR and between 64260ba and 246e6a4.

📒 Files selected for processing (2)

tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
tensorrt_llm/_torch/modules/fused_moe/moe_scheduler.py

coderabbitai · 2026-05-12T03:05:38Z

+        Uses ``max(ep_size, dp_size) * max(all_rank_num_tokens)`` when
+        communication is active. AllGatherReduceScatter (selected when
+        ``moe_tp_size != 1``) gathers ``dp_size * max_tokens_per_rank`` even
+        though ``moe_ep_size`` may be 1, while alltoall-family comms produce
+        ``ep_size * max_tokens_per_rank``. Using the max matches the
+        workspace sizing in the MoE scheduler.
        """
        if self.use_dp and self.comm is not None:
-            num_rows = self.mapping.moe_ep_size * max(all_rank_num_tokens)
+            num_rows = max(self.mapping.moe_ep_size, self.mapping.dp_size) * max(
+                all_rank_num_tokens
+            )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep calculate_num_chunks() aligned with the low-latency dispatch shape.

ExternalCommMoEScheduler._dispatched_rows() now sizes DeepEPLowLatency as num_slots * max_tokens_per_rank, but this branch still uses max(moe_ep_size, dp_size) * ... for every comm mode. If num_slots is larger, chunking is still underestimated on the low-latency path, so chunk count and DeepGemm workspace sizing diverge again. Please route both call sites through the same dispatched-row helper instead of duplicating the non-low-latency formula.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py` around lines 395 - 405, The current branch in calculate_num_chunks() computes num_rows using max(self.mapping.moe_ep_size, self.mapping.dp_size) * max(all_rank_num_tokens), which diverges from the low-latency dispatch sizing; replace that duplicated logic by calling the shared helper ExternalCommMoEScheduler._dispatched_rows(...) (the same helper used to size DeepEPLowLatency) so both comm modes use num_slots * max_tokens_per_rank consistently; ensure you pass the same arguments (mapping, all_rank_num_tokens, comm/use_dp context) to _dispatched_rows() and remove the bespoke max(...) expression to keep chunk counts and DeepGemm workspace sizing aligned.

tensorrt-cicd requested a review from a team as a code owner May 12, 2026 03:01

tensorrt-cicd requested a review from xxi-nv May 12, 2026 03:01

tensorrt-cicd assigned yuxianq May 12, 2026

github-actions Bot assigned tensorrt-cicd May 12, 2026

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/6163030][fix] Replace `moe_ep_size * max_tokens` with `max(moe_ep_size, dp_size) * max_tokens`#14023

[https://nvbugs/6163030][fix] Replace `moe_ep_size * max_tokens` with `max(moe_ep_size, dp_size) * max_tokens`#14023
tensorrt-cicd wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6163030

tensorrt-cicd commented May 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 12, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tensorrt-cicd commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Links

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 12, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tensorrt-cicd commented May 12, 2026 •

edited by coderabbitai Bot

Loading