[None][refactor] MoEScheduler split + MegaMoE EPLB / multi-chunk / CI integration#13908
[None][refactor] MoEScheduler split + MegaMoE EPLB / multi-chunk / CI integration#13908xxi-nv wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughThis PR refactors TensorRT-LLM's MoE implementation from a monolithic ConfigurableMoE design to a modular scheduler-driven architecture. It introduces the MegaMoE DeepGEMM backend as a first-class fused-communication MoE implementation, adds workspace lifecycle management with reference counting for NVLink AllToAll operations, and includes comprehensive test infrastructure to validate the new design across multiple hardware configurations and EPLB scenarios. ChangesMoE Scheduler Architecture Refactoring
MegaMoE DeepGEMM Backend & Test Coverage
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/fused_moe/create_moe.py (1)
1-21:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd the required NVIDIA SPDX header to this modified module.
This file now has Python source changes but still lacks the repository-required copyright/license header at the top.
As per coding guidelines "All C++, Python, and other source files must contain NVIDIA copyright header with current modification year".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py` around lines 1 - 21, This module (create_moe.py) is missing the required NVIDIA SPDX copyright/license header; add the repository-standard NVIDIA SPDX header (including the current modification year) as the very first lines of the file above all imports and definitions so the file that defines symbols like ConfigurableMoE, CuteDslFusedMoE, CutlassFusedMoE, DeepGemmFusedMoE, DenseGEMMFusedMoE, TritonFusedMoE, TRTLLMGenFusedMoE, VanillaMoE, WideEPMoE, MoE, MoEWeightLoadingMode, and MegaMoEDeepGemm contains the mandated header. Ensure the header text exactly matches the project's required NVIDIA SPDX format and include the current year.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/modules/fused_moe/mega_moe/mega_moe_deepgemm.py`:
- Around line 192-199: The constructor for MegaMoEDeepGEMM currently ignores the
activation_type parameter and always sets self.activation = "swiglu"; update the
constructor (and the other occurrence around lines 267-269) to map and validate
the ActivationType enum to the expected activation string (e.g.,
ActivationType.Swiglu -> "swiglu", ActivationType.GELU -> "gelu", etc.) and set
self.activation from activation_type instead of the hardcoded default; if an
unsupported ActivationType is passed, raise a ValueError or fallback explicitly
with a clear log message so callers of create_moe_backend() get the correct
fused activation behavior.
- Around line 125-173: can_implement() can return True in single-process builds
even though MegaMoEDeepGemm's constructor and _resolve_ep_pg() require
torch.distributed to be initialized; update can_implement() in class
MegaMoEDeepGemm to reject non-distributed runs by checking
torch.distributed.is_available() and torch.distributed.is_initialized() (use
both for safety) and return False with a clear message (e.g. "requires
torch.distributed to be available and initialized") so create_moe.get_moe_cls()
will fall back cleanly to other implementations like CutlassFusedMoE.
In `@tensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.md`:
- Around line 62-67: Update the documentation to match the actual
implementation: change the FUSED_COMM / FusedCommMoEScheduler description in
MOE_DEVELOPER_GUIDE.md (and the repeated block at lines 206-213) to state that
FusedCommMoEScheduler._forward_chunk() triggers an EP-wide AllReduce via the
load balancer helper (it calls _load_balancer_update_statistic(...,
ignore_allreduce=False)), so EPLB statistics are AllReduced across ranks by that
helper rather than being reduced purely inside the backend kernel; reference the
class/method names (FusedCommMoEScheduler._forward_chunk and
_load_balancer_update_statistic) and update the sentence that currently claims
"EPLB stats AllReduced internally" to reflect the real AllReduce routing
performed by the load balancer helper.
In `@tensorrt_llm/_torch/modules/fused_moe/moe_scheduler.py`:
- Around line 84-94: The abstract method declaration for forward currently
places the ellipsis on the same line and triggers flake8 E704; in the class
defining forward (method name forward) break the body so the ellipsis (...) is
on its own indented line under the signature (i.e., keep the def forward(...)
header unchanged but move the trailing ellipsis onto the next line indented to
the same level as the method body) so the abstractmethod body is a separate
statement and lint will pass.
- Around line 237-248: The multi-chunk workspace sizing in
_prepare_workspaces_for_chunk incorrectly always uses moe.mapping.moe_ep_size *
max_tokens; change it to mirror the DeepEPLowLatency formula used in
_prepare_workspace_deepgemm by using num_slots * max_tokens for
DeepEPLowLatency/DeepGemm paths: detect the DeepEPLowLatency/DeepGemm case (same
condition or type check used in _prepare_workspace_deepgemm), compute
chunk_size_0 as moe.mapping.num_slots * max(all_rank_num_tokens_list[0])
(falling back to chunk_size_list[0] as before), and ensure the appended second
workspace (when use_multi_stream) uses the same corrected size; keep references
to _prepare_workspace_deepgemm(), _prepare_workspaces_for_chunk(),
DeepEPLowLatency and DeepGemmFusedMoE.run_moe() to locate the relevant logic.
In `@tensorrt_llm/_torch/modules/fused_moe/quantization.py`:
- Around line 4900-4916: The current implementation leaves module._t_l1/_t_l2
cached after the first call to _transform_main_weights(), so subsequent
load_weights() calls reuse stale transformed tensors while the original raw
MXFP4 tensors remain resident; update load_weights() (or
_transform_main_weights()) to either clear/invalidate module._t_l1 and
module._t_l2 before re-transforming or, after obtaining the transformed pairs
from _transform_weights_for_mega_moe(), replace the original raw tensors
(module.w3_w1_weight, module.w3_w1_weight_scale, module.w2_weight,
module.w2_weight_scale) with the transformed single-source copies and free the
raw ones so there is only one authoritative weight representation and no stale
DG-form tensors remain.
- Around line 4702-4725: The copy_() call arguments in the block that writes to
dst_w3_w1_weight, dst_w3_w1_weight_scale, dst_w2_weight, and dst_w2_weight_scale
are using hanging indents that trigger Flake8 E126; rewrap each call so
continuations are aligned with the opening parenthesis (or place the first
argument on the same line as the function name) and indent subsequent wrapped
lines to the same column as the first argument, ensuring calls to
self._to_weight_device_uint8(...) and the non_blocking=True kwarg are properly
aligned for each copy_() invocation.
In `@tests/integration/test_lists/test-db/l0_dgx_b300.yml`:
- Around line 42-44: Add the MegaMoE EPLB selector for the MEGAMOE_DEEPGEMM
backend to the B300 post-merge matrix by inserting a counterpart entry for the
existing multi-GPU test selector: add a line for
unittest/_torch/modules/moe/test_moe_module.py::test_configurable_moe_multi_gpu_eplb[...]
that matches the MEGAMOE_DEEPGEMM parameters (e.g., routing=DeepSeekV3 and
quant=W4A8_MXFP4_MXFP8) so the EPLB happy-path multi-GPU case is scheduled
alongside the current test_configurable_moe_multi_gpu[...] and
test_configurable_moe_single_gpu[...] selectors.
In `@tests/unittest/_torch/modules/moe/test_moe_module.py`:
- Around line 128-142: Update _ensure_dist_for_megamoe to signal when it created
the NCCL default process group by returning True if it initialized the group and
False otherwise, and in tests that call it (the test(s) using MegaMoE) wrap the
test body in a try/finally: call _ensure_dist_for_megamoe(...), store its
boolean result, run the existing test logic in try, and in finally call
torch.distributed.destroy_process_group() only when the helper returned True;
reference the helper function _ensure_dist_for_megamoe and
torch.distributed.destroy_process_group to locate where to add the return value
and the try/finally cleanup.
- Around line 121-125: The current _get_free_tcp_port() opens, binds and
immediately closes the socket, causing a TOCTOU race with
dist.init_process_group(); instead, change the reservation pattern so the socket
stays open until workers call dist.init_process_group() (or implement a
retry-on-EADDRINUSE loop). Concretely, modify _get_free_tcp_port to bind and
listen but return both the reserved socket and port (or provide a context
manager reserve_tcp_port that yields the (sock, port)), update callers in the
test to keep the returned socket open across the rendezvous and only close it
after dist.init_process_group() completes; alternatively implement a retry loop
around _get_free_tcp_port + init_process_group to detect EADDRINUSE and retry
with a new port. Ensure references to _get_free_tcp_port and
dist.init_process_group are updated accordingly.
---
Outside diff comments:
In `@tensorrt_llm/_torch/modules/fused_moe/create_moe.py`:
- Around line 1-21: This module (create_moe.py) is missing the required NVIDIA
SPDX copyright/license header; add the repository-standard NVIDIA SPDX header
(including the current modification year) as the very first lines of the file
above all imports and definitions so the file that defines symbols like
ConfigurableMoE, CuteDslFusedMoE, CutlassFusedMoE, DeepGemmFusedMoE,
DenseGEMMFusedMoE, TritonFusedMoE, TRTLLMGenFusedMoE, VanillaMoE, WideEPMoE,
MoE, MoEWeightLoadingMode, and MegaMoEDeepGemm contains the mandated header.
Ensure the header text exactly matches the project's required NVIDIA SPDX format
and include the current year.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c89a4ddd-c5ca-4bd5-8818-46677f54c14f
📒 Files selected for processing (17)
cpp/tensorrt_llm/deep_gemm/CMakeLists.txttensorrt_llm/_torch/modules/fused_moe/MOE_DEVELOPER_GUIDE.mdtensorrt_llm/_torch/modules/fused_moe/communication/nvlink_one_sided.pytensorrt_llm/_torch/modules/fused_moe/configurable_moe.pytensorrt_llm/_torch/modules/fused_moe/create_moe.pytensorrt_llm/_torch/modules/fused_moe/interface.pytensorrt_llm/_torch/modules/fused_moe/mega_moe/__init__.pytensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.pytensorrt_llm/_torch/modules/fused_moe/mega_moe/mega_moe_deepgemm.pytensorrt_llm/_torch/modules/fused_moe/moe_scheduler.pytensorrt_llm/_torch/modules/fused_moe/quantization.pytests/integration/test_lists/test-db/l0_dgx_b200.ymltests/integration/test_lists/test-db/l0_dgx_b300.ymltests/unittest/_torch/modules/moe/moe_test_utils.pytests/unittest/_torch/modules/moe/quantize_utils.pytests/unittest/_torch/modules/moe/test_moe_backend.pytests/unittest/_torch/modules/moe/test_moe_module.py
💤 Files with no reviewable changes (1)
- tensorrt_llm/_torch/modules/fused_moe/mega_moe/backend.py
… integration - Split ConfigurableMoE forward into MoEScheduler (ExternalComm / FusedComm) - Refactor MegaMoEDeepGemm into ConfigurableMoE quant-method path - Add EPLB support (incl. dynamic), multi-chunk execution, and CI coverage - Defer DG NVLink SymmBuffer allocation from backend __init__ to create_weights so it sees ConfigurableMoE's synced num_slots / expert_size_per_partition (rather than the placeholder num_slots = num_experts seeded under init_load_balancer=False), keeping sym_buffer.num_experts == num_experts_per_rank * num_ranks under EPLB and matching the DeepGEMM mega.hpp contract - Add EPLB unit test Signed-off-by: xxi <xxi@nvidia.com>
e72cd0e to
87aa974
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #47410 [ run ] triggered by Bot. Commit: |
|
PR_Github #47410 [ run ] completed with state
|
…lement
`can_implement` is called from `get_quick_skip_reason` at pytest parametrize
collection time, long before the launcher initializes torch.distributed. The
prior `dist.is_initialized()` gate marked every MEGAMOE parametrize case as
"cannot implement" and dropped them in `iter_base_test_configs`, so
`pytest -k "MEGAMOE_DEEPGEMM"` collected 0/1180 items and exited 5 on
Jenkins L0_MergeRequest_PR #37337 (test_configurable_moe_{single,multi}_gpu).
`can_implement` should only answer static capability questions (SM / dtype /
quant / shape); whether an EP ProcessGroup is live is a runtime concern that
`__init__`'s `_resolve_ep_pg` already surfaces with a clear error. Removing
the probe restores MEGAMOE test collection and keeps the production failure
mode explicit instead of silently falling back.
Signed-off-by: xxi <xxi@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #47449 [ run ] triggered by Bot. Commit: |
Summary
ConfigurableMoE.forwardinto a dedicatedMoESchedulerwithExternalCommandFusedCommkinds;ConfigurableMoEbecomes a thin wrapper that owns DWDP recording and EPLBrepeat_idxrotation.MegaMoEDeepGemmfrom a standalone backend (mega_moe/backend.py, deleted) into aConfigurableMoEquant-method backend (mega_moe/mega_moe_deepgemm.py), reusing_BACKEND_SYNC_ATTRSfor layer / EPLB state mirroring.SymmBufferallocation from backend__init__tocreate_weights, so the buffer is sized afterConfigurableMoEsyncs the realnum_slots/expert_size_per_partitioninstead of the placeholdernum_slots = num_expertsseeded underinit_load_balancer=False. This keepssym_buffer.num_experts == num_experts_per_rank * num_ranksunder EPLB and matches the DeepGEMMmega.hpphost assertion contract.test_configurable_moe_multi_gpu_eplb).Test plan
pytest tests/unittest/_torch/modules/moe/test_moe_module.py -k \"MEGAMOE_DEEPGEMM and multi_gpu\" -vson GB200 / 4-GPU (DEP):multi_gpu[e8 Renormalize],multi_gpu[e8 DeepSeekV3],multi_gpu_eplb[e8_k2 slots=16 dynamic]), 2 SKIPPED (e256OOM threshold), 0 FAILED, nomega.hppassertion./bot runto exercise the rest of the MoE / MegaMoE CI matrix.Summary by CodeRabbit
New Features
Improvements
Documentation