Skip to content

[https://nvbugs/6163030][fix] Replace moe_ep_size * max_tokens with max(moe_ep_size, dp_size) * max_tokens#14023

Open
tensorrt-cicd wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6163030
Open

[https://nvbugs/6163030][fix] Replace moe_ep_size * max_tokens with max(moe_ep_size, dp_size) * max_tokens#14023
tensorrt-cicd wants to merge 1 commit into
NVIDIA:mainfrom
tensorrt-cicd:repair-bot-bug6163030

Conversation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

@tensorrt-cicd tensorrt-cicd commented May 12, 2026

Summary

  • Root cause: Workspace sizing used moe_ep_size * max_tokens_per_rank, which under-allocates for AllGatherReduceScatter dispatch (shape is dp_size * max_tokens_per_rank) when moe_tp_size > 1 and moe_ep_size < dp_size.
  • Fix: Replace moe_ep_size * max_tokens with max(moe_ep_size, dp_size) * max_tokens in _prepare_workspace_deepgemm, _prepare_workspaces_for_chunk, and calculate_num_chunks; DeepEPLowLatency branch is untouched, DeepEP-family paths are unchanged (they already satisfy moe_ep_size == dp_size under attention_dp).
  • Automated fix generated by repair-bot

Test plan

  • Verify fix on the same GPU type as the original failure
  • Check for regressions in related tests

Links

Summary by CodeRabbit

  • Bug Fixes

    • Improved buffer and workspace sizing calculations for distributed MOE scenarios, ensuring correct allocation across different communication configurations.
  • Refactor

    • Reorganized internal workspace sizing logic to better account for varied distributed training strategies.

Review Change Stack

When attention_dp is enabled with moe_tp_size > 1 (TP-MoE), the MoE
communication strategy is AllGatherReduceScatter, whose dispatch output
is [dp_size * max_tokens_per_rank, hidden]. The workspace/chunk sizing
previously used moe_ep_size * max_tokens_per_rank, which under-allocates
when moe_ep_size < dp_size (e.g. tp_size=4, ep_size=1 → moe_ep_size=1,
dp_size=4). DeepGemmFusedMoE.run_moe then tripped set_strides with a
storage size 4x smaller than required.

Use max(moe_ep_size, dp_size) * max_tokens_per_rank in the three
call sites (single-chunk workspace, multi-chunk workspace, and chunk
count), so AllGatherReduceScatter is sized correctly while DeepEP-family
comms (where moe_ep_size == dp_size under attention_dp) are unchanged.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

This PR consolidates MOE workspace sizing logic by introducing a centralized dispatch row-count helper method and applying it consistently across scheduler and configurable MOE modules. The sizing formula now uniformly accounts for both data-parallel size and expert-parallel size when calculating buffer dimensions under different communication strategies.

Changes

MOE Workspace Sizing Alignment

Layer / File(s) Summary
Dispatch row-count helper method
tensorrt_llm/_torch/modules/fused_moe/moe_scheduler.py
New _dispatched_rows(max_tokens_per_rank) method centralizes the dispatch row-count formula, using moe.num_slots for DeepEPLowLatency and max(moe.mapping.moe_ep_size, moe.mapping.dp_size) for other communication strategies.
DeepGemm workspace sizing with helper
tensorrt_llm/_torch/modules/fused_moe/moe_scheduler.py
Single-chunk and multi-chunk workspace sizing (num_rows and chunk_size_0) now derive row counts from _dispatched_rows() instead of duplicated comm-specific branching logic.
MOE chunk calculation alignment
tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
calculate_num_chunks() now uses max(moe_ep_size, dp_size) for num_rows when data-parallel is active with communication, aligning with the centralized helper formula and updated docstring.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: replacing a sizing formula in workspace allocation from moe_ep_size * max_tokens to max(moe_ep_size, dp_size) * max_tokens, which is the core fix for the bug.
Description check ✅ Passed The PR description includes all required sections: Summary (root cause and fix details), Test plan (verification steps), and Links (bug reference). All core sections are adequately filled out.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py`:
- Around line 395-405: The current branch in calculate_num_chunks() computes
num_rows using max(self.mapping.moe_ep_size, self.mapping.dp_size) *
max(all_rank_num_tokens), which diverges from the low-latency dispatch sizing;
replace that duplicated logic by calling the shared helper
ExternalCommMoEScheduler._dispatched_rows(...) (the same helper used to size
DeepEPLowLatency) so both comm modes use num_slots * max_tokens_per_rank
consistently; ensure you pass the same arguments (mapping, all_rank_num_tokens,
comm/use_dp context) to _dispatched_rows() and remove the bespoke max(...)
expression to keep chunk counts and DeepGemm workspace sizing aligned.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a5c55e86-db31-419b-9705-6ba120839d3e

📥 Commits

Reviewing files that changed from the base of the PR and between 64260ba and 246e6a4.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py
  • tensorrt_llm/_torch/modules/fused_moe/moe_scheduler.py

Comment on lines +395 to +405
Uses ``max(ep_size, dp_size) * max(all_rank_num_tokens)`` when
communication is active. AllGatherReduceScatter (selected when
``moe_tp_size != 1``) gathers ``dp_size * max_tokens_per_rank`` even
though ``moe_ep_size`` may be 1, while alltoall-family comms produce
``ep_size * max_tokens_per_rank``. Using the max matches the
workspace sizing in the MoE scheduler.
"""
if self.use_dp and self.comm is not None:
num_rows = self.mapping.moe_ep_size * max(all_rank_num_tokens)
num_rows = max(self.mapping.moe_ep_size, self.mapping.dp_size) * max(
all_rank_num_tokens
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep calculate_num_chunks() aligned with the low-latency dispatch shape.

ExternalCommMoEScheduler._dispatched_rows() now sizes DeepEPLowLatency as num_slots * max_tokens_per_rank, but this branch still uses max(moe_ep_size, dp_size) * ... for every comm mode. If num_slots is larger, chunking is still underestimated on the low-latency path, so chunk count and DeepGemm workspace sizing diverge again. Please route both call sites through the same dispatched-row helper instead of duplicating the non-low-latency formula.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py` around lines 395 -
405, The current branch in calculate_num_chunks() computes num_rows using
max(self.mapping.moe_ep_size, self.mapping.dp_size) * max(all_rank_num_tokens),
which diverges from the low-latency dispatch sizing; replace that duplicated
logic by calling the shared helper
ExternalCommMoEScheduler._dispatched_rows(...) (the same helper used to size
DeepEPLowLatency) so both comm modes use num_slots * max_tokens_per_rank
consistently; ensure you pass the same arguments (mapping, all_rank_num_tokens,
comm/use_dp context) to _dispatched_rows() and remove the bespoke max(...)
expression to keep chunk counts and DeepGemm workspace sizing aligned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants