Skip to content

Add UCC/NCCL alltoallv, deepEP v1/v2 and moe benchmark#891

Open
ybenvidia wants to merge 9 commits into
NVIDIA:mainfrom
ybenvidia:main
Open

Add UCC/NCCL alltoallv, deepEP v1/v2 and moe benchmark#891
ybenvidia wants to merge 9 commits into
NVIDIA:mainfrom
ybenvidia:main

Conversation

@ybenvidia
Copy link
Copy Markdown
Contributor

Add UCC/NCCL alltoallv vs DeepEP

  • add ucc alltoallv test
  • add nccl alltoallv test
  • add report generation for DeepEP vs NCCL vs UCC

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a MoE benchmark workload with Slurm launch/config wiring, reporting and SVG throughput reporter; refactors DeepEP cmd-args and Slurm strategy for v1/v2; discovers DeepEP outputs to drive NCCL/UCC matrix-enabled tests; updates configs, registration, and tests.

Changes

MoE Benchmark Integration Feature

Layer / File(s) Summary
MoE benchmark core definition
src/cloudai/workloads/moe_benchmark/moe_benchmark.py, src/cloudai/workloads/moe_benchmark/__init__.py
MoEBenchmarkCmdArgs and MoEBenchmarkTestDefinition added with Docker image handling and cmd_args serialization.
MoE Slurm command strategy
src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py
Generates SBATCH/head-node detection, writes config.yaml, mounts config/results, selects benchmark script by mode, and provides success check.
DeepEP output discovery helpers
src/cloudai/workloads/moe_benchmark/combined_report.py
Adds dependency-chain traversal, MoE benchmark root detection via matrix/stdout marker, and results.json discovery.
MoE report generation
src/cloudai/workloads/moe_benchmark/report_generation_strategy.py
New report strategy that finds MoE results.json files, validates/enriches entries, and emits CSV including bus bandwidth columns.
MoE throughput reporter
src/cloudai/workloads/moe_benchmark/throughput_reporter.py
Parses MoE/UCC/NCCL outputs and renders <scenario>-moe-throughput.svg summarizing mean bus bandwidth.
DeepEP schema refactor
src/cloudai/workloads/deepep/deepep.py
Replaces DeepEPCmdArgs with v1/v2 schema (roots, subtest_name, execution fields); adds container_runtime_root and container_runtime_root_path; removes cmd_args_dict.
DeepEP Slurm strategy refactor
src/cloudai/workloads/deepep/slurm_command_gen_strategy.py
Exports MASTER_ADDR/MASTER_PORT, writes env_vars.sh, selects script by subtest roots, whitelists CLI fields, removes DeepEP mounts, and broadens success-check patterns.
DeepEP export cleanup
src/cloudai/workloads/deepep/__init__.py
Removes DeepEPReportGenerationStrategy from public exports.
NCCL alltoallv field definitions
src/cloudai/workloads/nccl_test/nccl.py
Adds alltoallv_perf_mpi/alltoallv_perf literals; adds use_deepep_matrix and alltoallv_matrix_container_path.
NCCL Slurm DeepEP matrix integration
src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py
Discovers nccl_matrix.txt under DeepEP outputs, mounts matrix+prev-output, injects ALLTOALLV_MATRIX_FILE, and special-cases alltoallv_perf_mpi command.
UCC matrix field and integration
src/cloudai/workloads/ucc_test/ucc.py, src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py
Adds use_deepep_matrix to UCCCmdArgs; resolves/mounts ucc_matrix.txt and appends --gen file:name=... when applicable.
Configuration files
conf/experimental/test/*
Updates deepep_standard tokens; adds multiple DeepEP and MoE TOML configs; updates NCCL/UCC configs for matrix mode; adds scenario TOMLs linking tests.
Registration and wiring
src/cloudai/registration.py
Wires MoE benchmark imports, registers MoEBenchmarkTestDefinition, Slurm strategy, report generation strategy, and enables moe_benchmark_throughput reporter.
Test suite and ref data
tests/*, tests/ref_data/*
Adds moe-benchmark test cases and expectations, updates registry tests, and adds moe-benchmark.sbatch reference.
Cleanup
.gitignore
Ignores slurm-* artifacts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested labels

enhancement

Suggested reviewers

  • srivatsankrishnan
  • jeffnvidia

Poem

🐰 In racks where experts mix and play,
I chased the matrices through night and day.
From DeepEP roots to MoE's bright chart,
I stitched the outputs, piece by part.
Now throughput blooms — a rabbit's art!

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The description accurately mentions adding UCC and NCCL alltoallv tests with report generation for comparison; however, it does not mention the substantial addition of a new MoE benchmark workload which constitutes a major part of this PR. Update the description to include the MoE benchmark workload addition, such as: 'Adds UCC/NCCL alltoallv tests, DeepEP v1/v2 configurations, and a new MoE benchmark workload with report generation for cross-benchmark comparison.'
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately reflects the main changes: addition of UCC/NCCL alltoallv tests and deepEP v1/v2 support plus MoE benchmark, which are the primary features across the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cloudai/workloads/deepep/slurm_command_gen_strategy.py (1)

28-134: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix Ruff formatting for this file before merge.

CI is failing on ruff-format for this file; please run formatting and commit the normalized output to unblock the pipeline.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py` around lines 28 -
134, The file fails ruff-format checks; run the ruff formatter and commit the
normalized file so CI passes. Specifically, run your project's ruff/formatter
(e.g., ruff format) on the module containing methods like
_append_head_node_detection, _append_sbatch_directives, _container_mounts,
_generate_config_yaml and gen_srun_success_check, review the changes for any
unintended edits, and commit the formatted file to the branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/workloads/deepep/deepep_moe_throughput_reporter.py`:
- Line 15: The import of Reporter in deepep_moe_throughput_reporter.py currently
uses the private path cloudai._core.base_reporter; replace that with the allowed
public reporter interface used by other workloads (import the same Reporter
symbol from the public reporter module instead) so the file imports Reporter
from the public/allowed package rather than cloudai._core.base_reporter.
- Around line 23-25: The function _deepep_dispatch_combine_bars is too complex
and triggers C901 and has an unused loop variable causing B007; split it into
small helpers (e.g., deepep_results_json_files -> leave, add helpers like
_load_latest_results(path: Path) -> dict,
_extract_dispatch_combine_rows(results: dict) -> Iterable[dict], and
_row_to_bar(row: dict) -> tuple[str, float, str]) and have
_deepep_dispatch_combine_bars orchestrate these helpers to reduce cyclomatic
complexity; also rename any unused loop variable "lab" to "_lab" (or prefix with
underscore) to satisfy B007, and finally run ruff format to fix formatting
issues.

In `@src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py`:
- Around line 51-58: If use_deepep_matrix is true, don’t silently return None
when deepep_benchmark_root or the nccl matrix file is missing; instead fail fast
by raising a clear exception (e.g., RuntimeError/ValueError) from
_deepep_nccl_matrix_host_path that includes context (test/run id and that
use_deepep_matrix was requested) so the caller will stop the run; apply the same
change to the analogous methods referenced in the comment (the other
deepep-related path/resolution methods around lines 62-67 and 81-83, such as the
container-path resolver), and use deepep_benchmark_root and
_nccl_matrix_path_under_deepep_output in the error message to make the root
cause obvious.
- Around line 95-112: The alltoallv_perf_mpi branch only appends the fixed flags
and misses forwarding other configured NCCL options; update the branch in
slurm_command_gen_strategy.py (the block using _NCCL_TESTS_ALLTOALLV_PERF,
tdef.cmd_args and _nccl_cmd_scalar) to conditionally append the additional flags
present on tdef.cmd_args (e.g., stepfactor, nthreads, check, blocking and any
other optional fields) in the same way you handle minbytes/maxbytes/ngpus (use
_nccl_cmd_scalar where appropriate and add boolean flags when true), and keep
the existing inclusion of self.test_run.test.extra_args_str when
test.extra_cmd_args is set so TOML-configured options are not dropped.

In `@src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py`:
- Around line 55-64: If tdef.cmd_args.use_deepep_matrix is true, do not silently
return []; instead detect missing matrix by checking
deepep_benchmark_root(self.test_run) and self._deepep_ucc_matrix_host_path() and
raise a clear exception (or call fail-fast helper) when either is None and there
is no manual generation option provided (e.g. no tdef.cmd_args.gen or equivalent
flag). Update the logic in the block guarded by use_deepep_matrix (and the
analogous block later around the other check) to validate deepep_root and
matrix_host and raise an informative error mentioning the missing DeepEP matrix
and required --gen/manual generation flag rather than falling back to running
without a matrix.

---

Outside diff comments:
In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py`:
- Around line 28-134: The file fails ruff-format checks; run the ruff formatter
and commit the normalized file so CI passes. Specifically, run your project's
ruff/formatter (e.g., ruff format) on the module containing methods like
_append_head_node_detection, _append_sbatch_directives, _container_mounts,
_generate_config_yaml and gen_srun_success_check, review the changes for any
unintended edits, and commit the formatted file to the branch.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0416a4db-8f59-4498-afdb-cfa1d0fa7cce

📥 Commits

Reviewing files that changed from the base of the PR and between 2d672e1 and 12f6742.

📒 Files selected for processing (15)
  • conf/experimental/test/deepep_standard.toml
  • conf/experimental/test/nccl_test_alltoallv.toml
  • conf/experimental/test_scenario/deepep_with_nccl_alltoallv.toml
  • conf/experimental/test_scenario/deepep_with_ucc_alltoallv.toml
  • src/cloudai/registration.py
  • src/cloudai/workloads/deepep/__init__.py
  • src/cloudai/workloads/deepep/deepep.py
  • src/cloudai/workloads/deepep/deepep_combined_report.py
  • src/cloudai/workloads/deepep/deepep_moe_throughput_reporter.py
  • src/cloudai/workloads/deepep/report_generation_strategy.py
  • src/cloudai/workloads/deepep/slurm_command_gen_strategy.py
  • src/cloudai/workloads/nccl_test/nccl.py
  • src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py
  • src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py
  • src/cloudai/workloads/ucc_test/ucc.py

Comment thread src/cloudai/workloads/moe_benchmark/throughput_reporter.py Outdated
Comment on lines +23 to +25
def _deepep_dispatch_combine_bars(test_output: Path) -> list[tuple[str, float, str]]:
"""From latest ``results.json``: one bar per ``dispatch`` / ``combine`` row (``bus_bw_avg``)."""
paths = deepep_results_json_files(test_output)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resolve Ruff blockers in this new reporter file.

CI is red on C901 (_deepep_dispatch_combine_bars complexity), B007 (unused lab), and formatting. Please split _deepep_dispatch_combine_bars into smaller helpers, rename unused loop vars (e.g., _lab), and run ruff format.

Also applies to: 164-166

🧰 Tools
🪛 GitHub Actions: CI / 2_Linting.txt

[error] 23-25: Ruff error (C901): _deepep_dispatch_combine_bars is too complex (11 > 10).

🪛 GitHub Actions: CI / Linting

[error] 23-25: Ruff check (C901): _deepep_dispatch_combine_bars is too complex (11 > 10).


[error] Ruff format check failed for this file (hook 'ruff-format' reformatted files, causing the hook to fail).


[error] import-linter contract broken: cloudai.workloads is not allowed to import cloudai._core (offending dependency: cloudai.workloads.deepep.deepep_moe_throughput_reporter -> cloudai._core.base_reporter).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/deepep/deepep_moe_throughput_reporter.py` around lines
23 - 25, The function _deepep_dispatch_combine_bars is too complex and triggers
C901 and has an unused loop variable causing B007; split it into small helpers
(e.g., deepep_results_json_files -> leave, add helpers like
_load_latest_results(path: Path) -> dict,
_extract_dispatch_combine_rows(results: dict) -> Iterable[dict], and
_row_to_bar(row: dict) -> tuple[str, float, str]) and have
_deepep_dispatch_combine_bars orchestrate these helpers to reduce cyclomatic
complexity; also rename any unused loop variable "lab" to "_lab" (or prefix with
underscore) to satisfy B007, and finally run ruff format to fix formatting
issues.

Comment on lines +51 to +58
def _deepep_nccl_matrix_host_path(self) -> Path | None:
tdef: NCCLTestDefinition = cast(NCCLTestDefinition, self.test_run.test)
if not tdef.cmd_args.use_deepep_matrix:
return None
root = deepep_benchmark_root(self.test_run)
if root is None:
return None
return _nccl_matrix_path_under_deepep_output(root)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when use_deepep_matrix is enabled but matrix discovery fails.

Current flow silently skips mounts/env var when the DeepEP dependency or nccl_matrix.txt is missing, so the run proceeds without the requested matrix input and can produce misleading comparison data.

💡 Suggested fix
 def _deepep_nccl_matrix_host_path(self) -> Path | None:
     tdef: NCCLTestDefinition = cast(NCCLTestDefinition, self.test_run.test)
     if not tdef.cmd_args.use_deepep_matrix:
         return None
     root = deepep_benchmark_root(self.test_run)
-    if root is None:
-        return None
-    return _nccl_matrix_path_under_deepep_output(root)
+    if root is None:
+        raise FileNotFoundError(
+            "use_deepep_matrix=true but no start_post_comp DeepEP output was found"
+        )
+    matrix = _nccl_matrix_path_under_deepep_output(root)
+    if matrix is None:
+        raise FileNotFoundError(
+            f"use_deepep_matrix=true but nccl_matrix.txt was not found under {root}"
+        )
+    return matrix

Also applies to: 62-67, 81-83

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py` around lines
51 - 58, If use_deepep_matrix is true, don’t silently return None when
deepep_benchmark_root or the nccl matrix file is missing; instead fail fast by
raising a clear exception (e.g., RuntimeError/ValueError) from
_deepep_nccl_matrix_host_path that includes context (test/run id and that
use_deepep_matrix was requested) so the caller will stop the run; apply the same
change to the analogous methods referenced in the comment (the other
deepep-related path/resolution methods around lines 62-67 and 81-83, such as the
container-path resolver), and use deepep_benchmark_root and
_nccl_matrix_path_under_deepep_output in the error message to make the root
cause obvious.

Comment on lines +95 to +112
if tdef.cmd_args.subtest_name == "alltoallv_perf_mpi":
a = tdef.cmd_args
parts: List[str] = [
_NCCL_TESTS_ALLTOALLV_PERF,
"-b",
str(_nccl_cmd_scalar(a.minbytes)),
"-e",
str(_nccl_cmd_scalar(a.maxbytes)),
"-g",
str(_nccl_cmd_scalar(a.ngpus)),
"-w",
str(_nccl_cmd_scalar(a.warmup_iters)),
"-n",
str(_nccl_cmd_scalar(a.iters)),
]
if self.test_run.test.extra_cmd_args:
parts.append(self.test_run.test.extra_args_str)
return parts
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Dedicated alltoallv_perf_mpi path drops configured NCCL options.

This branch only forwards -b/-e/-g/-w/-n and ignores other configured fields (e.g. stepfactor, nthreads, check, blocking), so TOML values are silently ineffective for this subtest.

💡 Suggested fix
         if tdef.cmd_args.subtest_name == "alltoallv_perf_mpi":
             a = tdef.cmd_args
             parts: List[str] = [
                 _NCCL_TESTS_ALLTOALLV_PERF,
                 "-b",
                 str(_nccl_cmd_scalar(a.minbytes)),
                 "-e",
                 str(_nccl_cmd_scalar(a.maxbytes)),
                 "-g",
                 str(_nccl_cmd_scalar(a.ngpus)),
                 "-w",
                 str(_nccl_cmd_scalar(a.warmup_iters)),
                 "-n",
                 str(_nccl_cmd_scalar(a.iters)),
             ]
+            optional_flags = [
+                ("-f", a.stepfactor),
+                ("-t", a.nthreads),
+                ("-c", a.check),
+                ("-z", a.blocking),
+            ]
+            for flag, raw_value in optional_flags:
+                value = _nccl_cmd_scalar(raw_value)
+                if value is not None:
+                    parts.extend([flag, str(value)])
             if self.test_run.test.extra_cmd_args:
                 parts.append(self.test_run.test.extra_args_str)
             return parts
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if tdef.cmd_args.subtest_name == "alltoallv_perf_mpi":
a = tdef.cmd_args
parts: List[str] = [
_NCCL_TESTS_ALLTOALLV_PERF,
"-b",
str(_nccl_cmd_scalar(a.minbytes)),
"-e",
str(_nccl_cmd_scalar(a.maxbytes)),
"-g",
str(_nccl_cmd_scalar(a.ngpus)),
"-w",
str(_nccl_cmd_scalar(a.warmup_iters)),
"-n",
str(_nccl_cmd_scalar(a.iters)),
]
if self.test_run.test.extra_cmd_args:
parts.append(self.test_run.test.extra_args_str)
return parts
if tdef.cmd_args.subtest_name == "alltoallv_perf_mpi":
a = tdef.cmd_args
parts: List[str] = [
_NCCL_TESTS_ALLTOALLV_PERF,
"-b",
str(_nccl_cmd_scalar(a.minbytes)),
"-e",
str(_nccl_cmd_scalar(a.maxbytes)),
"-g",
str(_nccl_cmd_scalar(a.ngpus)),
"-w",
str(_nccl_cmd_scalar(a.warmup_iters)),
"-n",
str(_nccl_cmd_scalar(a.iters)),
]
optional_flags = [
("-f", a.stepfactor),
("-t", a.nthreads),
("-c", a.check),
("-z", a.blocking),
]
for flag, raw_value in optional_flags:
value = _nccl_cmd_scalar(raw_value)
if value is not None:
parts.extend([flag, str(value)])
if self.test_run.test.extra_cmd_args:
parts.append(self.test_run.test.extra_args_str)
return parts
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py` around lines
95 - 112, The alltoallv_perf_mpi branch only appends the fixed flags and misses
forwarding other configured NCCL options; update the branch in
slurm_command_gen_strategy.py (the block using _NCCL_TESTS_ALLTOALLV_PERF,
tdef.cmd_args and _nccl_cmd_scalar) to conditionally append the additional flags
present on tdef.cmd_args (e.g., stepfactor, nthreads, check, blocking and any
other optional fields) in the same way you handle minbytes/maxbytes/ngpus (use
_nccl_cmd_scalar where appropriate and add boolean flags when true), and keep
the existing inclusion of self.test_run.test.extra_args_str when
test.extra_cmd_args is set so TOML-configured options are not dropped.

Comment on lines +55 to +64
if not tdef.cmd_args.use_deepep_matrix:
return []

deepep_root = deepep_benchmark_root(self.test_run)
if deepep_root is None:
return []

matrix_host = self._deepep_ucc_matrix_host_path()
if matrix_host is None:
return []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when use_deepep_matrix is enabled but the matrix is missing.

Current flow silently falls back (no mount + no --gen) and can run a non-matrix UCC test while the scenario claims DeepEP-matrix-driven comparison. Please raise an explicit error when matrix resolution fails and no manual gen is provided.

Suggested fix
 def _container_mounts(self) -> List[str]:
     tdef: UCCTestDefinition = cast(UCCTestDefinition, self.test_run.test)
     if not tdef.cmd_args.use_deepep_matrix:
         return []
@@
     matrix_host = self._deepep_ucc_matrix_host_path()
-    if matrix_host is None:
-        return []
+    if matrix_host is None:
+        raise FileNotFoundError(
+            "use_deepep_matrix=true but no ucc_matrix.txt was found in DeepEP outputs"
+        )
@@
 def generate_test_command(self) -> List[str]:
@@
-    elif self._deepep_ucc_matrix_host_path() is not None:
-        srun_command_parts.append(f"--gen file:name={_UCC_GEN_MATRIX_CONTAINER}")
+    elif tdef_cmd_args.use_deepep_matrix:
+        matrix_host = self._deepep_ucc_matrix_host_path()
+        if matrix_host is None:
+            raise FileNotFoundError(
+                "use_deepep_matrix=true but no ucc_matrix.txt was found in DeepEP outputs"
+            )
+        srun_command_parts.append(f"--gen file:name={_UCC_GEN_MATRIX_CONTAINER}")

Also applies to: 85-86

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py` around lines 55
- 64, If tdef.cmd_args.use_deepep_matrix is true, do not silently return [];
instead detect missing matrix by checking deepep_benchmark_root(self.test_run)
and self._deepep_ucc_matrix_host_path() and raise a clear exception (or call
fail-fast helper) when either is None and there is no manual generation option
provided (e.g. no tdef.cmd_args.gen or equivalent flag). Update the logic in the
block guarded by use_deepep_matrix (and the analogous block later around the
other check) to validate deepep_root and matrix_host and raise an informative
error mentioning the missing DeepEP matrix and required --gen/manual generation
flag rather than falling back to running without a matrix.

@podkidyshev
Copy link
Copy Markdown
Contributor

podkidyshev commented May 15, 2026

please resolve coderabbit comments and make CI pass before review

ping me again directly once it's done please

@ybenvidia ybenvidia changed the title Add UCC/NCCL alltoallv vs DeepEP Add UCC/NCCL alltoallv, deepEP va/v2 and moe benchmark Jun 2, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
tests/test_init.py (1)

241-241: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Update test definition count assertion.

The test expects 26 test definitions, but with the addition of MoEBenchmarkTestDefinition, there are now 27. Please update the assertion to reflect the correct count.

🔧 Proposed fix
-    assert len(test_defs) == 26
+    assert len(test_defs) == 27
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_init.py` at line 241, Update the assertion that checks the number
of test definitions from 26 to the correct count to include the newly added
MoEBenchmarkTestDefinition; locate the assertion referencing test_defs (assert
len(test_defs) == 26) in tests/test_init.py and change the expected value to 27
so the test reflects the added MoEBenchmarkTestDefinition.
src/cloudai/workloads/moe_benchmark/throughput_reporter.py (1)

23-57: 🧹 Nitpick | 🔵 Trivial | ⚖️ Poor tradeoff

Consider reducing function complexity.

The function _moe_benchmark_dispatch_combine_bars has a cyclomatic complexity of 11, exceeding the threshold of 10. Consider extracting the row-filtering and value-extraction logic into smaller helper functions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py` around lines 23 -
57, The function _moe_benchmark_dispatch_combine_bars is too complex; extract
the row-filtering/value-extraction and file-reading into small helpers to reduce
cyclomatic complexity. Create a helper like _read_latest_results(test_output) to
return parsed rows (or [] on error) and another helper like _extract_bus_bw(row)
that validates a row dict, ensures "operation" is a str and contains
"bus_bw_avg", normalizes operation to lowercase and returns (op_l,
float(bus_bw_avg)) or None on failure; then simplify
_moe_benchmark_dispatch_combine_bars to call _read_latest_results and iterate
the rows using _extract_bus_bw, populate by_op and build the out list (keep
existing tuple labels "MoE dispatch"/"MoE combine" and colors).
tests/test_acceptance.py (1)

721-721: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Missing reference sbatch file for moe-benchmark test.

The test test_sbatch_generation[moe-benchmark] expects a reference file at tests/ref_data/moe-benchmark.sbatch, but the file does not exist. This causes test failures. Please add the expected reference sbatch file for the moe-benchmark workload.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_acceptance.py` at line 721, The test
test_sbatch_generation[moe-benchmark] fails because the expected reference
sbatch file is missing; add a new reference file named moe-benchmark.sbatch
under the tests/ref_data directory (the path constructed by
Path(__file__).parent / "ref_data" / test_req[1] in tests/test_acceptance.py)
containing the expected sbatch output for the moe-benchmark workload so the
assertion comparing generated SBATCH to the reference succeeds. Ensure the file
name exactly matches "moe-benchmark.sbatch" and the contents match the generator
output used by the test.
conf/experimental/test/moe_benchmark_standard.toml (1)

34-36: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Remove trailing blank lines.

The pre-commit hook end-of-file-fixer modified this file by removing trailing blank lines. Remove lines 34-36 to fix the CI failure.

🐛 Proposed fix
 [extra_env_vars]
 NUM_QPS_PER_RANK = "12"
 NUM_SMS = "24"
-
-
-
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@conf/experimental/test/moe_benchmark_standard.toml` around lines 34 - 36,
Remove the trailing blank lines at the end of
conf/experimental/test/moe_benchmark_standard.toml so the file ends immediately
after its last TOML entry (delete the extra blank lines at EOF that were added);
ensure there is no extra newline-only lines after the final content so the
end-of-file-fixer pre-commit hook is satisfied.
♻️ Duplicate comments (1)
src/cloudai/workloads/moe_benchmark/throughput_reporter.py (1)

15-15: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix the layering violation by using the public Reporter import path.

The import from cloudai._core.base_reporter violates the import-linter contract (cloudai.workloads must not import cloudai._core). Use the public Reporter interface instead—likely from cloudai.reporter import Reporter or the appropriate public module.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py` at line 15, The
file imports Reporter from the private module cloudai._core.base_reporter which
breaks layering rules; update the import to use the public Reporter export
(e.g., replace "from cloudai._core.base_reporter import Reporter" with the
public import such as "from cloudai.reporter import Reporter" or the correct
public module that exposes Reporter) so that throughput_reporter.py only depends
on the public API.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@conf/experimental/test_scenario/moe_benchmark.toml`:
- Around line 22-29: The Tests chain currently makes Tests.nccl_alltoallv depend
on Tests.ucc_alltoallv (serial), causing NCCL to wait for UCC; if both only
require the MoE output, change the dependency of Tests.ucc_alltoallv and
Tests.nccl_alltoallv to depend directly on Tests.moe_benchmark (replace the
dependency id in the [[Tests.dependencies]] block for both Tests.ucc_alltoallv
and Tests.nccl_alltoallv to "Tests.moe_benchmark") so they start in parallel
after the MoE benchmark; if the serial ordering was intentional for resource or
debugging reasons, confirm and leave as-is.

In `@conf/experimental/test/nccl_test_alltoallv.toml`:
- Line 37: The file ends with the line 'NCCL_SHM_DISABLE = "1"' but is missing a
trailing newline; add a single newline character after the NCCL_SHM_DISABLE line
(i.e., ensure the file ends with a newline) so the end-of-file-fixer pre-commit
check passes and CI succeeds.

In `@src/cloudai/registration.py`:
- Line 321: The call to Registry().add_scenario_report(...) exceeds the 120-char
limit; reformat the invocation for MoEBenchmarkThroughputReporter and
ReportConfig(enable=True) so the line is <=120 chars (for example, break the
arguments across multiple lines or assign arguments to temporary variables)
while keeping the same call to Registry().add_scenario_report with the same
symbols: Registry, add_scenario_report, MoEBenchmarkThroughputReporter, and
ReportConfig.

In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py`:
- Around line 182-189: generate_test_command currently hardcodes "torchrun
--nproc_per_node=1" and ignores DeepEPCmdArgs.python_executable; compute the
per-node process count from cmd_args.num_processes and num_nodes (e.g., per_node
= max(1, int(cmd_args.num_processes / num_nodes)) and validate evenly divisible
or handle remainder), then set "--nproc_per_node={per_node}" instead of the
fixed 1, and prepend cmd_args.python_executable when building the launcher
invocation (use cmd_args.python_executable + " " + self._script_path(cmd_args)
or equivalent) so generate_test_command and the launcher respect
DeepEPCmdArgs.num_processes and python_executable.
- Around line 169-170: The code calls _append_cli_field() which references an
undefined _BOOL_VALUE_FIELDS causing a NameError; define or import
_BOOL_VALUE_FIELDS (e.g., the set/list of field names treated as boolean flags)
and update _append_cli_field to use it, or replace the check with explicit
boolean-type handling (refer to the function name _append_cli_field). Also
update generate_test_command() to construct the launcher from DeepEPCmdArgs: use
DeepEPCmdArgs.python_executable instead of hardcoding "torchrun" (or document
and remove the unused field) and pass DeepEPCmdArgs.num_processes to the
launcher (replace the hardcoded "--nproc_per_node=1"), ensuring the launcher
flags match existing cmd arg names so DeepEP launcher aligns with the command
arguments.

In `@src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py`:
- Around line 60-64: Replace the simple if-else assignment for script_name with
a ternary expression to satisfy ruff SIM108: where the code currently checks
cmd_args.mode and sets script_name, change it to a single-line assignment using
"benchmark.py" if cmd_args.mode == "standard" else "benchmark_ll.py" (update the
assignment around the script_name variable).
- Line 69: The _generate_config_yaml method declares an unused parameter
cmd_args; remove the cmd_args parameter from the _generate_config_yaml signature
and all internal references (it isn't used because tdef = self.test_run.test and
tdef.cmd_args_dict supply the needed data), then update the caller that invokes
_generate_config_yaml (the callsite that currently passes a MoEBenchmarkCmdArgs)
to stop passing that argument so the call matches the new signature; ensure no
other callsites reference the removed parameter and run tests/lint to confirm.

In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py`:
- Line 159: The for-loop that zips centers, values, colors, and labels uses an
unused loop variable named `lab`; update that variable to `_lab` in the loop
header (for cx, val, col, _lab in zip(centers, values, colors, labels,
strict=True)) to signal it's intentionally unused while preserving the zip
ordering—locate this change in the loop inside throughput_reporter.py where
`centers`, `values`, `colors`, and `labels` are iterated.
- Line 157: The SVG f-string in throughput_reporter.py (e.g., the parts.append
call that builds the <text ...>{gv:.1f}</text>) exceeds the 120-char line
length; refactor these SVG generation lines (including the similar ones at the
locations around lines 169-171 and 173) by splitting the long f-strings into
smaller parts or intermediate variables (e.g., compute x_attr, y_attr, attrs,
and text_content separately) and then join/concatenate them in a short
parts.append call so each source line stays under the 120-character limit while
preserving the same output and formatting.

In `@tests/test_test_scenario.py`:
- Around line 47-50: Remove the trailing whitespace after the imported symbol
MoEBenchmarkReportGenerationStrategy in the import statement that also includes
MoEBenchmarkTestDefinition; edit the import line so it reads without any extra
spaces after MoEBenchmarkReportGenerationStrategy, preserving the same symbols
and line breaks to satisfy the pre-commit hook.

---

Outside diff comments:
In `@conf/experimental/test/moe_benchmark_standard.toml`:
- Around line 34-36: Remove the trailing blank lines at the end of
conf/experimental/test/moe_benchmark_standard.toml so the file ends immediately
after its last TOML entry (delete the extra blank lines at EOF that were added);
ensure there is no extra newline-only lines after the final content so the
end-of-file-fixer pre-commit hook is satisfied.

In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py`:
- Around line 23-57: The function _moe_benchmark_dispatch_combine_bars is too
complex; extract the row-filtering/value-extraction and file-reading into small
helpers to reduce cyclomatic complexity. Create a helper like
_read_latest_results(test_output) to return parsed rows (or [] on error) and
another helper like _extract_bus_bw(row) that validates a row dict, ensures
"operation" is a str and contains "bus_bw_avg", normalizes operation to
lowercase and returns (op_l, float(bus_bw_avg)) or None on failure; then
simplify _moe_benchmark_dispatch_combine_bars to call _read_latest_results and
iterate the rows using _extract_bus_bw, populate by_op and build the out list
(keep existing tuple labels "MoE dispatch"/"MoE combine" and colors).

In `@tests/test_acceptance.py`:
- Line 721: The test test_sbatch_generation[moe-benchmark] fails because the
expected reference sbatch file is missing; add a new reference file named
moe-benchmark.sbatch under the tests/ref_data directory (the path constructed by
Path(__file__).parent / "ref_data" / test_req[1] in tests/test_acceptance.py)
containing the expected sbatch output for the moe-benchmark workload so the
assertion comparing generated SBATCH to the reference succeeds. Ensure the file
name exactly matches "moe-benchmark.sbatch" and the contents match the generator
output used by the test.

In `@tests/test_init.py`:
- Line 241: Update the assertion that checks the number of test definitions from
26 to the correct count to include the newly added MoEBenchmarkTestDefinition;
locate the assertion referencing test_defs (assert len(test_defs) == 26) in
tests/test_init.py and change the expected value to 27 so the test reflects the
added MoEBenchmarkTestDefinition.

---

Duplicate comments:
In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py`:
- Line 15: The file imports Reporter from the private module
cloudai._core.base_reporter which breaks layering rules; update the import to
use the public Reporter export (e.g., replace "from cloudai._core.base_reporter
import Reporter" with the public import such as "from cloudai.reporter import
Reporter" or the correct public module that exposes Reporter) so that
throughput_reporter.py only depends on the public API.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 25bf978f-ad63-4710-8250-1199086f3041

📥 Commits

Reviewing files that changed from the base of the PR and between 3a743d3 and 72a6f84.

📒 Files selected for processing (28)
  • .gitignore
  • conf/experimental/test/deepep_low_latency.toml
  • conf/experimental/test/deepep_standard.toml
  • conf/experimental/test/deepep_test_ep_v2.toml
  • conf/experimental/test/deepep_test_internode.toml
  • conf/experimental/test/deepep_test_intranode.toml
  • conf/experimental/test/deepep_test_low_latency.toml
  • conf/experimental/test/moe_benchmark_low_latency.toml
  • conf/experimental/test/moe_benchmark_standard.toml
  • conf/experimental/test/nccl_test_alltoallv.toml
  • conf/experimental/test/ucc_alltoallv_deepep.toml
  • conf/experimental/test_scenario/deepep_official.toml
  • conf/experimental/test_scenario/moe_benchmark.toml
  • src/cloudai/registration.py
  • src/cloudai/workloads/deepep/__init__.py
  • src/cloudai/workloads/deepep/deepep.py
  • src/cloudai/workloads/deepep/slurm_command_gen_strategy.py
  • src/cloudai/workloads/moe_benchmark/__init__.py
  • src/cloudai/workloads/moe_benchmark/combined_report.py
  • src/cloudai/workloads/moe_benchmark/moe_benchmark.py
  • src/cloudai/workloads/moe_benchmark/report_generation_strategy.py
  • src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py
  • src/cloudai/workloads/moe_benchmark/throughput_reporter.py
  • src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py
  • src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py
  • tests/test_acceptance.py
  • tests/test_init.py
  • tests/test_test_scenario.py
💤 Files with no reviewable changes (3)
  • conf/experimental/test/deepep_low_latency.toml
  • conf/experimental/test/deepep_standard.toml
  • src/cloudai/workloads/deepep/init.py

Comment on lines +22 to +29
[[Tests]]
id = "Tests.nccl_alltoallv"
test_name = "nccl_test_alltoallv"
num_nodes = 2
time_limit = "00:30:00"
[[Tests.dependencies]]
type = "start_post_comp"
id = "Tests.ucc_alltoallv"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚖️ Poor tradeoff

Verify the dependency chain design: serial vs. parallel execution.

The current configuration chains tests serially: moe_benchmarkucc_alltoallvnccl_alltoallv. This means NCCL test waits for UCC test to complete, even though both tests independently consume matrices from the MoE benchmark output (as indicated by layers 10 and 12 in the stack context).

If both UCC and NCCL tests only need the MoE benchmark output and don't depend on each other, consider having both depend directly on Tests.moe_benchmark:

[[Tests]]
id = "Tests.ucc_alltoallv"
test_name = "ucc_alltoallv_deepep"
num_nodes = 2
time_limit = "00:30:00"
  [[Tests.dependencies]]
  type = "start_post_comp"
  id = "Tests.moe_benchmark"

[[Tests]]
id = "Tests.nccl_alltoallv"
test_name = "nccl_test_alltoallv"
num_nodes = 2
time_limit = "00:30:00"
  [[Tests.dependencies]]
  type = "start_post_comp"
  id = "Tests.moe_benchmark"  # Changed from Tests.ucc_alltoallv

This would allow UCC and NCCL tests to run in parallel after MoE benchmark completes, reducing total scenario runtime. The current serial chain may be intentional for resource management or debugging—please confirm this design choice.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@conf/experimental/test_scenario/moe_benchmark.toml` around lines 22 - 29, The
Tests chain currently makes Tests.nccl_alltoallv depend on Tests.ucc_alltoallv
(serial), causing NCCL to wait for UCC; if both only require the MoE output,
change the dependency of Tests.ucc_alltoallv and Tests.nccl_alltoallv to depend
directly on Tests.moe_benchmark (replace the dependency id in the
[[Tests.dependencies]] block for both Tests.ucc_alltoallv and
Tests.nccl_alltoallv to "Tests.moe_benchmark") so they start in parallel after
the MoE benchmark; if the serial ordering was intentional for resource or
debugging reasons, confirm and leave as-is.

Comment thread conf/experimental/test/nccl_test_alltoallv.toml Outdated
Comment thread src/cloudai/registration.py Outdated
Comment thread src/cloudai/workloads/deepep/slurm_command_gen_strategy.py Outdated
Comment on lines +182 to +189
parts: list[str] = [
"torchrun",
f"--nnodes={num_nodes}",
"--nproc_per_node=1",
"--rdzv_id=$RANDOM",
"--rdzv_id=\\${SLURM_JOB_ID:-0}",
"--rdzv_backend=c10d",
"--rdzv_endpoint=$head_node_ip:29500",
benchmark_script,
str(config_file_path.absolute()),
]

return command_parts

def _generate_config_yaml(self, config_path: Path, cmd_args: DeepEPCmdArgs) -> None:
"""
Generate YAML configuration file for DeepEP benchmark.

Args:
config_path: Path where to write the config file.
cmd_args: Command arguments for the benchmark.
"""
tdef: DeepEPTestDefinition = cast(DeepEPTestDefinition, self.test_run.test)

config_lines = [
"# DeepEP Benchmark Configuration",
"# Generated by CloudAI",
"",
"--rdzv_endpoint=\\${MASTER_ADDR}:\\${MASTER_PORT}",
self._script_path(cmd_args),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
printf '\n# DeepEP cmd args\n'
sed -n '24,75p' src/cloudai/workloads/deepep/deepep.py

printf '\n# Torch launcher construction\n'
sed -n '179,194p' src/cloudai/workloads/deepep/slurm_command_gen_strategy.py

printf '\n# Config surface using these fields\n'
rg -n --type=toml '\b(python_executable|num_processes)\b' conf

Repository: NVIDIA/cloudai

Length of output: 2706


Honor DeepEPCmdArgs launcher settings (python_executable/num_processes) in SLURM torchrun command.

In src/cloudai/workloads/deepep/slurm_command_gen_strategy.py (lines 182-189), generate_test_command() hardcodes torchrun --nproc_per_node=1 while DeepEPCmdArgs.num_processes is exposed (default 8, and set to 8 in the DeepEP test TOMLs) and is passed to the script as --num-processes. This means the configured process count isn’t reflected in the actual torch distributed worker topology. Also, DeepEPCmdArgs.python_executable is never used when constructing the launcher.

Update the torchrun command to derive --nproc_per_node from cmd_args.num_processes (with an explicit global-vs-per-node conversion) and to use cmd_args.python_executable when building the launcher.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py` around lines 182
- 189, generate_test_command currently hardcodes "torchrun --nproc_per_node=1"
and ignores DeepEPCmdArgs.python_executable; compute the per-node process count
from cmd_args.num_processes and num_nodes (e.g., per_node = max(1,
int(cmd_args.num_processes / num_nodes)) and validate evenly divisible or handle
remainder), then set "--nproc_per_node={per_node}" instead of the fixed 1, and
prepend cmd_args.python_executable when building the launcher invocation (use
cmd_args.python_executable + " " + self._script_path(cmd_args) or equivalent) so
generate_test_command and the launcher respect DeepEPCmdArgs.num_processes and
python_executable.

Comment thread src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py Outdated
Comment thread src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py Outdated
Comment thread src/cloudai/workloads/moe_benchmark/throughput_reporter.py Outdated
Comment thread src/cloudai/workloads/moe_benchmark/throughput_reporter.py Outdated
Comment thread tests/test_test_scenario.py Outdated
@ybenvidia ybenvidia changed the title Add UCC/NCCL alltoallv, deepEP va/v2 and moe benchmark Add UCC/NCCL alltoallv, deepEP v1/v2 and moe benchmark Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants