Add UCC/NCCL alltoallv, deepEP v1/v2 and moe benchmark by ybenvidia · Pull Request #891 · NVIDIA/cloudai

ybenvidia · 2026-05-14T13:58:09Z

Add UCC/NCCL alltoallv vs DeepEP

add ucc alltoallv test
add nccl alltoallv test
add report generation for DeepEP vs NCCL vs UCC

coderabbitai · 2026-05-14T13:58:22Z

📝 Walkthrough

Walkthrough

Adds a MoE benchmark workload with Slurm launch/config wiring, reporting and SVG throughput reporter; refactors DeepEP cmd-args and Slurm strategy for v1/v2; discovers DeepEP outputs to drive NCCL/UCC matrix-enabled tests; updates configs, registration, and tests.

Changes

MoE Benchmark Integration Feature

Layer / File(s)	Summary
MoE benchmark core definition `src/cloudai/workloads/moe_benchmark/moe_benchmark.py`, `src/cloudai/workloads/moe_benchmark/__init__.py`	`MoEBenchmarkCmdArgs` and `MoEBenchmarkTestDefinition` added with Docker image handling and cmd_args serialization.
MoE Slurm command strategy `src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py`	Generates SBATCH/head-node detection, writes `config.yaml`, mounts config/results, selects benchmark script by `mode`, and provides success check.
DeepEP output discovery helpers `src/cloudai/workloads/moe_benchmark/combined_report.py`	Adds dependency-chain traversal, MoE benchmark root detection via matrix/stdout marker, and results.json discovery.
MoE report generation `src/cloudai/workloads/moe_benchmark/report_generation_strategy.py`	New report strategy that finds MoE results.json files, validates/enriches entries, and emits CSV including bus bandwidth columns.
MoE throughput reporter `src/cloudai/workloads/moe_benchmark/throughput_reporter.py`	Parses MoE/UCC/NCCL outputs and renders `<scenario>-moe-throughput.svg` summarizing mean bus bandwidth.
DeepEP schema refactor `src/cloudai/workloads/deepep/deepep.py`	Replaces DeepEPCmdArgs with v1/v2 schema (roots, `subtest_name`, execution fields); adds `container_runtime_root` and `container_runtime_root_path`; removes `cmd_args_dict`.
DeepEP Slurm strategy refactor `src/cloudai/workloads/deepep/slurm_command_gen_strategy.py`	Exports `MASTER_ADDR`/`MASTER_PORT`, writes `env_vars.sh`, selects script by subtest roots, whitelists CLI fields, removes DeepEP mounts, and broadens success-check patterns.
DeepEP export cleanup `src/cloudai/workloads/deepep/__init__.py`	Removes `DeepEPReportGenerationStrategy` from public exports.
NCCL alltoallv field definitions `src/cloudai/workloads/nccl_test/nccl.py`	Adds `alltoallv_perf_mpi`/`alltoallv_perf` literals; adds `use_deepep_matrix` and `alltoallv_matrix_container_path`.
NCCL Slurm DeepEP matrix integration `src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py`	Discovers `nccl_matrix.txt` under DeepEP outputs, mounts matrix+prev-output, injects `ALLTOALLV_MATRIX_FILE`, and special-cases `alltoallv_perf_mpi` command.
UCC matrix field and integration `src/cloudai/workloads/ucc_test/ucc.py`, `src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py`	Adds `use_deepep_matrix` to `UCCCmdArgs`; resolves/mounts `ucc_matrix.txt` and appends `--gen file:name=...` when applicable.
Configuration files `conf/experimental/test/*`	Updates `deepep_standard` tokens; adds multiple DeepEP and MoE TOML configs; updates NCCL/UCC configs for matrix mode; adds scenario TOMLs linking tests.
Registration and wiring `src/cloudai/registration.py`	Wires MoE benchmark imports, registers `MoEBenchmarkTestDefinition`, Slurm strategy, report generation strategy, and enables `moe_benchmark_throughput` reporter.
Test suite and ref data `tests/`, `tests/ref_data/`	Adds `moe-benchmark` test cases and expectations, updates registry tests, and adds `moe-benchmark.sbatch` reference.
Cleanup `.gitignore`	Ignores `slurm-*` artifacts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested labels

enhancement

Suggested reviewers

srivatsankrishnan
jeffnvidia

Poem

🐰 In racks where experts mix and play,
I chased the matrices through night and day.
From DeepEP roots to MoE's bright chart,
I stitched the outputs, piece by part.
Now throughput blooms — a rabbit's art!

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The description accurately mentions adding UCC and NCCL alltoallv tests with report generation for comparison; however, it does not mention the substantial addition of a new MoE benchmark workload which constitutes a major part of this PR.	Update the description to include the MoE benchmark workload addition, such as: 'Adds UCC/NCCL alltoallv tests, DeepEP v1/v2 configurations, and a new MoE benchmark workload with report generation for cross-benchmark comparison.'

✅ Passed checks (3 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately reflects the main changes: addition of UCC/NCCL alltoallv tests and deepEP v1/v2 support plus MoE benchmark, which are the primary features across the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/cloudai/workloads/deepep/slurm_command_gen_strategy.py (1)
28-134: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix Ruff formatting for this file before merge.

CI is failing on ruff-format for this file; please run formatting and commit the normalized output to unblock the pipeline.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py` around lines 28 -
134, The file fails ruff-format checks; run the ruff formatter and commit the
normalized file so CI passes. Specifically, run your project's ruff/formatter
(e.g., ruff format) on the module containing methods like
_append_head_node_detection, _append_sbatch_directives, _container_mounts,
_generate_config_yaml and gen_srun_success_check, review the changes for any
unintended edits, and commit the formatted file to the branch.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/workloads/deepep/deepep_moe_throughput_reporter.py`:
- Line 15: The import of Reporter in deepep_moe_throughput_reporter.py currently
uses the private path cloudai._core.base_reporter; replace that with the allowed
public reporter interface used by other workloads (import the same Reporter
symbol from the public reporter module instead) so the file imports Reporter
from the public/allowed package rather than cloudai._core.base_reporter.
- Around line 23-25: The function _deepep_dispatch_combine_bars is too complex
and triggers C901 and has an unused loop variable causing B007; split it into
small helpers (e.g., deepep_results_json_files -> leave, add helpers like
_load_latest_results(path: Path) -> dict,
_extract_dispatch_combine_rows(results: dict) -> Iterable[dict], and
_row_to_bar(row: dict) -> tuple[str, float, str]) and have
_deepep_dispatch_combine_bars orchestrate these helpers to reduce cyclomatic
complexity; also rename any unused loop variable "lab" to "_lab" (or prefix with
underscore) to satisfy B007, and finally run ruff format to fix formatting
issues.

In `@src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py`:
- Around line 51-58: If use_deepep_matrix is true, don’t silently return None
when deepep_benchmark_root or the nccl matrix file is missing; instead fail fast
by raising a clear exception (e.g., RuntimeError/ValueError) from
_deepep_nccl_matrix_host_path that includes context (test/run id and that
use_deepep_matrix was requested) so the caller will stop the run; apply the same
change to the analogous methods referenced in the comment (the other
deepep-related path/resolution methods around lines 62-67 and 81-83, such as the
container-path resolver), and use deepep_benchmark_root and
_nccl_matrix_path_under_deepep_output in the error message to make the root
cause obvious.
- Around line 95-112: The alltoallv_perf_mpi branch only appends the fixed flags
and misses forwarding other configured NCCL options; update the branch in
slurm_command_gen_strategy.py (the block using _NCCL_TESTS_ALLTOALLV_PERF,
tdef.cmd_args and _nccl_cmd_scalar) to conditionally append the additional flags
present on tdef.cmd_args (e.g., stepfactor, nthreads, check, blocking and any
other optional fields) in the same way you handle minbytes/maxbytes/ngpus (use
_nccl_cmd_scalar where appropriate and add boolean flags when true), and keep
the existing inclusion of self.test_run.test.extra_args_str when
test.extra_cmd_args is set so TOML-configured options are not dropped.

In `@src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py`:
- Around line 55-64: If tdef.cmd_args.use_deepep_matrix is true, do not silently
return []; instead detect missing matrix by checking
deepep_benchmark_root(self.test_run) and self._deepep_ucc_matrix_host_path() and
raise a clear exception (or call fail-fast helper) when either is None and there
is no manual generation option provided (e.g. no tdef.cmd_args.gen or equivalent
flag). Update the logic in the block guarded by use_deepep_matrix (and the
analogous block later around the other check) to validate deepep_root and
matrix_host and raise an informative error mentioning the missing DeepEP matrix
and required --gen/manual generation flag rather than falling back to running
without a matrix.

---

Outside diff comments:
In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py`:
- Around line 28-134: The file fails ruff-format checks; run the ruff formatter
and commit the normalized file so CI passes. Specifically, run your project's
ruff/formatter (e.g., ruff format) on the module containing methods like
_append_head_node_detection, _append_sbatch_directives, _container_mounts,
_generate_config_yaml and gen_srun_success_check, review the changes for any
unintended edits, and commit the formatted file to the branch.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0416a4db-8f59-4498-afdb-cfa1d0fa7cce

📥 Commits

Reviewing files that changed from the base of the PR and between 2d672e1 and 12f6742.

📒 Files selected for processing (15)

conf/experimental/test/deepep_standard.toml
conf/experimental/test/nccl_test_alltoallv.toml
conf/experimental/test_scenario/deepep_with_nccl_alltoallv.toml
conf/experimental/test_scenario/deepep_with_ucc_alltoallv.toml
src/cloudai/registration.py
src/cloudai/workloads/deepep/__init__.py
src/cloudai/workloads/deepep/deepep.py
src/cloudai/workloads/deepep/deepep_combined_report.py
src/cloudai/workloads/deepep/deepep_moe_throughput_reporter.py
src/cloudai/workloads/deepep/report_generation_strategy.py
src/cloudai/workloads/deepep/slurm_command_gen_strategy.py
src/cloudai/workloads/nccl_test/nccl.py
src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py
src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py
src/cloudai/workloads/ucc_test/ucc.py

coderabbitai · 2026-05-14T14:02:36Z

+def _deepep_dispatch_combine_bars(test_output: Path) -> list[tuple[str, float, str]]:
+    """From latest ``results.json``: one bar per ``dispatch`` / ``combine`` row (``bus_bw_avg``)."""
+    paths = deepep_results_json_files(test_output)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resolve Ruff blockers in this new reporter file.

CI is red on C901 (_deepep_dispatch_combine_bars complexity), B007 (unused lab), and formatting. Please split _deepep_dispatch_combine_bars into smaller helpers, rename unused loop vars (e.g., _lab), and run ruff format.

Also applies to: 164-166

🧰 Tools

🪛 GitHub Actions: CI / 2_Linting.txt

[error] 23-25: Ruff error (C901): _deepep_dispatch_combine_bars is too complex (11 > 10).

🪛 GitHub Actions: CI / Linting

[error] 23-25: Ruff check (C901): _deepep_dispatch_combine_bars is too complex (11 > 10).

[error] Ruff format check failed for this file (hook 'ruff-format' reformatted files, causing the hook to fail).

[error] import-linter contract broken: cloudai.workloads is not allowed to import cloudai._core (offending dependency: cloudai.workloads.deepep.deepep_moe_throughput_reporter -> cloudai._core.base_reporter).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/deepep/deepep_moe_throughput_reporter.py` around lines 23 - 25, The function _deepep_dispatch_combine_bars is too complex and triggers C901 and has an unused loop variable causing B007; split it into small helpers (e.g., deepep_results_json_files -> leave, add helpers like _load_latest_results(path: Path) -> dict, _extract_dispatch_combine_rows(results: dict) -> Iterable[dict], and _row_to_bar(row: dict) -> tuple[str, float, str]) and have _deepep_dispatch_combine_bars orchestrate these helpers to reduce cyclomatic complexity; also rename any unused loop variable "lab" to "_lab" (or prefix with underscore) to satisfy B007, and finally run ruff format to fix formatting issues.

coderabbitai · 2026-05-14T14:02:36Z

+    def _deepep_nccl_matrix_host_path(self) -> Path | None:
+        tdef: NCCLTestDefinition = cast(NCCLTestDefinition, self.test_run.test)
+        if not tdef.cmd_args.use_deepep_matrix:
+            return None
+        root = deepep_benchmark_root(self.test_run)
+        if root is None:
+            return None
+        return _nccl_matrix_path_under_deepep_output(root)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when use_deepep_matrix is enabled but matrix discovery fails.

Current flow silently skips mounts/env var when the DeepEP dependency or nccl_matrix.txt is missing, so the run proceeds without the requested matrix input and can produce misleading comparison data.

💡 Suggested fix

def _deepep_nccl_matrix_host_path(self) -> Path | None: tdef: NCCLTestDefinition = cast(NCCLTestDefinition, self.test_run.test) if not tdef.cmd_args.use_deepep_matrix: return None root = deepep_benchmark_root(self.test_run) - if root is None: - return None - return _nccl_matrix_path_under_deepep_output(root) + if root is None: + raise FileNotFoundError( + "use_deepep_matrix=true but no start_post_comp DeepEP output was found" + ) + matrix = _nccl_matrix_path_under_deepep_output(root) + if matrix is None: + raise FileNotFoundError( + f"use_deepep_matrix=true but nccl_matrix.txt was not found under {root}" + ) + return matrix

Also applies to: 62-67, 81-83

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py` around lines 51 - 58, If use_deepep_matrix is true, don’t silently return None when deepep_benchmark_root or the nccl matrix file is missing; instead fail fast by raising a clear exception (e.g., RuntimeError/ValueError) from _deepep_nccl_matrix_host_path that includes context (test/run id and that use_deepep_matrix was requested) so the caller will stop the run; apply the same change to the analogous methods referenced in the comment (the other deepep-related path/resolution methods around lines 62-67 and 81-83, such as the container-path resolver), and use deepep_benchmark_root and _nccl_matrix_path_under_deepep_output in the error message to make the root cause obvious.

coderabbitai · 2026-05-14T14:02:36Z

+        if tdef.cmd_args.subtest_name == "alltoallv_perf_mpi":
+            a = tdef.cmd_args
+            parts: List[str] = [
+                _NCCL_TESTS_ALLTOALLV_PERF,
+                "-b",
+                str(_nccl_cmd_scalar(a.minbytes)),
+                "-e",
+                str(_nccl_cmd_scalar(a.maxbytes)),
+                "-g",
+                str(_nccl_cmd_scalar(a.ngpus)),
+                "-w",
+                str(_nccl_cmd_scalar(a.warmup_iters)),
+                "-n",
+                str(_nccl_cmd_scalar(a.iters)),
+            ]
+            if self.test_run.test.extra_cmd_args:
+                parts.append(self.test_run.test.extra_args_str)
+            return parts


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Dedicated alltoallv_perf_mpi path drops configured NCCL options.

This branch only forwards -b/-e/-g/-w/-n and ignores other configured fields (e.g. stepfactor, nthreads, check, blocking), so TOML values are silently ineffective for this subtest.

💡 Suggested fix

if tdef.cmd_args.subtest_name == "alltoallv_perf_mpi": a = tdef.cmd_args parts: List[str] = [ _NCCL_TESTS_ALLTOALLV_PERF, "-b", str(_nccl_cmd_scalar(a.minbytes)), "-e", str(_nccl_cmd_scalar(a.maxbytes)), "-g", str(_nccl_cmd_scalar(a.ngpus)), "-w", str(_nccl_cmd_scalar(a.warmup_iters)), "-n", str(_nccl_cmd_scalar(a.iters)), ] + optional_flags = [ + ("-f", a.stepfactor), + ("-t", a.nthreads), + ("-c", a.check), + ("-z", a.blocking), + ] + for flag, raw_value in optional_flags: + value = _nccl_cmd_scalar(raw_value) + if value is not None: + parts.extend([flag, str(value)]) if self.test_run.test.extra_cmd_args: parts.append(self.test_run.test.extra_args_str) return parts

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if tdef.cmd_args.subtest_name == "alltoallv_perf_mpi":

a = tdef.cmd_args

parts: List[str] = [

_NCCL_TESTS_ALLTOALLV_PERF,

"-b",

str(_nccl_cmd_scalar(a.minbytes)),

"-e",

str(_nccl_cmd_scalar(a.maxbytes)),

"-g",

str(_nccl_cmd_scalar(a.ngpus)),

"-w",

str(_nccl_cmd_scalar(a.warmup_iters)),

"-n",

str(_nccl_cmd_scalar(a.iters)),

]

if self.test_run.test.extra_cmd_args:

parts.append(self.test_run.test.extra_args_str)

return parts

if tdef.cmd_args.subtest_name == "alltoallv_perf_mpi":

a = tdef.cmd_args

parts: List[str] = [

_NCCL_TESTS_ALLTOALLV_PERF,

"-b",

str(_nccl_cmd_scalar(a.minbytes)),

"-e",

str(_nccl_cmd_scalar(a.maxbytes)),

"-g",

str(_nccl_cmd_scalar(a.ngpus)),

"-w",

str(_nccl_cmd_scalar(a.warmup_iters)),

"-n",

str(_nccl_cmd_scalar(a.iters)),

]

optional_flags = [

("-f", a.stepfactor),

("-t", a.nthreads),

("-c", a.check),

("-z", a.blocking),

]

for flag, raw_value in optional_flags:

value = _nccl_cmd_scalar(raw_value)

if value is not None:

parts.extend([flag, str(value)])

if self.test_run.test.extra_cmd_args:

parts.append(self.test_run.test.extra_args_str)

return parts

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py` around lines 95 - 112, The alltoallv_perf_mpi branch only appends the fixed flags and misses forwarding other configured NCCL options; update the branch in slurm_command_gen_strategy.py (the block using _NCCL_TESTS_ALLTOALLV_PERF, tdef.cmd_args and _nccl_cmd_scalar) to conditionally append the additional flags present on tdef.cmd_args (e.g., stepfactor, nthreads, check, blocking and any other optional fields) in the same way you handle minbytes/maxbytes/ngpus (use _nccl_cmd_scalar where appropriate and add boolean flags when true), and keep the existing inclusion of self.test_run.test.extra_args_str when test.extra_cmd_args is set so TOML-configured options are not dropped.

coderabbitai · 2026-05-14T14:02:36Z

+        if not tdef.cmd_args.use_deepep_matrix:
+            return []
+
+        deepep_root = deepep_benchmark_root(self.test_run)
+        if deepep_root is None:
+            return []
+
+        matrix_host = self._deepep_ucc_matrix_host_path()
+        if matrix_host is None:
+            return []


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when use_deepep_matrix is enabled but the matrix is missing.

Current flow silently falls back (no mount + no --gen) and can run a non-matrix UCC test while the scenario claims DeepEP-matrix-driven comparison. Please raise an explicit error when matrix resolution fails and no manual gen is provided.

Suggested fix

def _container_mounts(self) -> List[str]: tdef: UCCTestDefinition = cast(UCCTestDefinition, self.test_run.test) if not tdef.cmd_args.use_deepep_matrix: return [] @@ matrix_host = self._deepep_ucc_matrix_host_path() - if matrix_host is None: - return [] + if matrix_host is None: + raise FileNotFoundError( + "use_deepep_matrix=true but no ucc_matrix.txt was found in DeepEP outputs" + ) @@ def generate_test_command(self) -> List[str]: @@ - elif self._deepep_ucc_matrix_host_path() is not None: - srun_command_parts.append(f"--gen file:name={_UCC_GEN_MATRIX_CONTAINER}") + elif tdef_cmd_args.use_deepep_matrix: + matrix_host = self._deepep_ucc_matrix_host_path() + if matrix_host is None: + raise FileNotFoundError( + "use_deepep_matrix=true but no ucc_matrix.txt was found in DeepEP outputs" + ) + srun_command_parts.append(f"--gen file:name={_UCC_GEN_MATRIX_CONTAINER}")

Also applies to: 85-86

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py` around lines 55 - 64, If tdef.cmd_args.use_deepep_matrix is true, do not silently return []; instead detect missing matrix by checking deepep_benchmark_root(self.test_run) and self._deepep_ucc_matrix_host_path() and raise a clear exception (or call fail-fast helper) when either is None and there is no manual generation option provided (e.g. no tdef.cmd_args.gen or equivalent flag). Update the logic in the block guarded by use_deepep_matrix (and the analogous block later around the other check) to validate deepep_root and matrix_host and raise an informative error mentioning the missing DeepEP matrix and required --gen/manual generation flag rather than falling back to running without a matrix.

podkidyshev · 2026-05-15T13:11:07Z

please resolve coderabbit comments and make CI pass before review

ping me again directly once it's done please

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

tests/test_init.py (1)
241-241: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Update test definition count assertion.

The test expects 26 test definitions, but with the addition of MoEBenchmarkTestDefinition, there are now 27. Please update the assertion to reflect the correct count.
🔧 Proposed fix
-    assert len(test_defs) == 26
+    assert len(test_defs) == 27
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_init.py` at line 241, Update the assertion that checks the number
of test definitions from 26 to the correct count to include the newly added
MoEBenchmarkTestDefinition; locate the assertion referencing test_defs (assert
len(test_defs) == 26) in tests/test_init.py and change the expected value to 27
so the test reflects the added MoEBenchmarkTestDefinition.
src/cloudai/workloads/moe_benchmark/throughput_reporter.py (1)
23-57: 🧹 Nitpick | 🔵 Trivial | ⚖️ Poor tradeoff

Consider reducing function complexity.

The function _moe_benchmark_dispatch_combine_bars has a cyclomatic complexity of 11, exceeding the threshold of 10. Consider extracting the row-filtering and value-extraction logic into smaller helper functions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py` around lines 23 -
57, The function _moe_benchmark_dispatch_combine_bars is too complex; extract
the row-filtering/value-extraction and file-reading into small helpers to reduce
cyclomatic complexity. Create a helper like _read_latest_results(test_output) to
return parsed rows (or [] on error) and another helper like _extract_bus_bw(row)
that validates a row dict, ensures "operation" is a str and contains
"bus_bw_avg", normalizes operation to lowercase and returns (op_l,
float(bus_bw_avg)) or None on failure; then simplify
_moe_benchmark_dispatch_combine_bars to call _read_latest_results and iterate
the rows using _extract_bus_bw, populate by_op and build the out list (keep
existing tuple labels "MoE dispatch"/"MoE combine" and colors).
tests/test_acceptance.py (1)
721-721: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Missing reference sbatch file for moe-benchmark test.

The test test_sbatch_generation[moe-benchmark] expects a reference file at tests/ref_data/moe-benchmark.sbatch, but the file does not exist. This causes test failures. Please add the expected reference sbatch file for the moe-benchmark workload.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_acceptance.py` at line 721, The test
test_sbatch_generation[moe-benchmark] fails because the expected reference
sbatch file is missing; add a new reference file named moe-benchmark.sbatch
under the tests/ref_data directory (the path constructed by
Path(__file__).parent / "ref_data" / test_req[1] in tests/test_acceptance.py)
containing the expected sbatch output for the moe-benchmark workload so the
assertion comparing generated SBATCH to the reference succeeds. Ensure the file
name exactly matches "moe-benchmark.sbatch" and the contents match the generator
output used by the test.
conf/experimental/test/moe_benchmark_standard.toml (1)
34-36: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Remove trailing blank lines.

The pre-commit hook end-of-file-fixer modified this file by removing trailing blank lines. Remove lines 34-36 to fix the CI failure.
🐛 Proposed fix
 [extra_env_vars]
 NUM_QPS_PER_RANK = "12"
 NUM_SMS = "24"
-
-
-
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@conf/experimental/test/moe_benchmark_standard.toml` around lines 34 - 36,
Remove the trailing blank lines at the end of
conf/experimental/test/moe_benchmark_standard.toml so the file ends immediately
after its last TOML entry (delete the extra blank lines at EOF that were added);
ensure there is no extra newline-only lines after the final content so the
end-of-file-fixer pre-commit hook is satisfied.

♻️ Duplicate comments (1)

src/cloudai/workloads/moe_benchmark/throughput_reporter.py (1)
15-15: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix the layering violation by using the public Reporter import path.

The import from cloudai._core.base_reporter violates the import-linter contract (cloudai.workloads must not import cloudai._core). Use the public Reporter interface instead—likely from cloudai.reporter import Reporter or the appropriate public module.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py` at line 15, The
file imports Reporter from the private module cloudai._core.base_reporter which
breaks layering rules; update the import to use the public Reporter export
(e.g., replace "from cloudai._core.base_reporter import Reporter" with the
public import such as "from cloudai.reporter import Reporter" or the correct
public module that exposes Reporter) so that throughput_reporter.py only depends
on the public API.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@conf/experimental/test_scenario/moe_benchmark.toml`:
- Around line 22-29: The Tests chain currently makes Tests.nccl_alltoallv depend
on Tests.ucc_alltoallv (serial), causing NCCL to wait for UCC; if both only
require the MoE output, change the dependency of Tests.ucc_alltoallv and
Tests.nccl_alltoallv to depend directly on Tests.moe_benchmark (replace the
dependency id in the [[Tests.dependencies]] block for both Tests.ucc_alltoallv
and Tests.nccl_alltoallv to "Tests.moe_benchmark") so they start in parallel
after the MoE benchmark; if the serial ordering was intentional for resource or
debugging reasons, confirm and leave as-is.

In `@conf/experimental/test/nccl_test_alltoallv.toml`:
- Line 37: The file ends with the line 'NCCL_SHM_DISABLE = "1"' but is missing a
trailing newline; add a single newline character after the NCCL_SHM_DISABLE line
(i.e., ensure the file ends with a newline) so the end-of-file-fixer pre-commit
check passes and CI succeeds.

In `@src/cloudai/registration.py`:
- Line 321: The call to Registry().add_scenario_report(...) exceeds the 120-char
limit; reformat the invocation for MoEBenchmarkThroughputReporter and
ReportConfig(enable=True) so the line is <=120 chars (for example, break the
arguments across multiple lines or assign arguments to temporary variables)
while keeping the same call to Registry().add_scenario_report with the same
symbols: Registry, add_scenario_report, MoEBenchmarkThroughputReporter, and
ReportConfig.

In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py`:
- Around line 182-189: generate_test_command currently hardcodes "torchrun
--nproc_per_node=1" and ignores DeepEPCmdArgs.python_executable; compute the
per-node process count from cmd_args.num_processes and num_nodes (e.g., per_node
= max(1, int(cmd_args.num_processes / num_nodes)) and validate evenly divisible
or handle remainder), then set "--nproc_per_node={per_node}" instead of the
fixed 1, and prepend cmd_args.python_executable when building the launcher
invocation (use cmd_args.python_executable + " " + self._script_path(cmd_args)
or equivalent) so generate_test_command and the launcher respect
DeepEPCmdArgs.num_processes and python_executable.
- Around line 169-170: The code calls _append_cli_field() which references an
undefined _BOOL_VALUE_FIELDS causing a NameError; define or import
_BOOL_VALUE_FIELDS (e.g., the set/list of field names treated as boolean flags)
and update _append_cli_field to use it, or replace the check with explicit
boolean-type handling (refer to the function name _append_cli_field). Also
update generate_test_command() to construct the launcher from DeepEPCmdArgs: use
DeepEPCmdArgs.python_executable instead of hardcoding "torchrun" (or document
and remove the unused field) and pass DeepEPCmdArgs.num_processes to the
launcher (replace the hardcoded "--nproc_per_node=1"), ensuring the launcher
flags match existing cmd arg names so DeepEP launcher aligns with the command
arguments.

In `@src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py`:
- Around line 60-64: Replace the simple if-else assignment for script_name with
a ternary expression to satisfy ruff SIM108: where the code currently checks
cmd_args.mode and sets script_name, change it to a single-line assignment using
"benchmark.py" if cmd_args.mode == "standard" else "benchmark_ll.py" (update the
assignment around the script_name variable).
- Line 69: The _generate_config_yaml method declares an unused parameter
cmd_args; remove the cmd_args parameter from the _generate_config_yaml signature
and all internal references (it isn't used because tdef = self.test_run.test and
tdef.cmd_args_dict supply the needed data), then update the caller that invokes
_generate_config_yaml (the callsite that currently passes a MoEBenchmarkCmdArgs)
to stop passing that argument so the call matches the new signature; ensure no
other callsites reference the removed parameter and run tests/lint to confirm.

In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py`:
- Line 159: The for-loop that zips centers, values, colors, and labels uses an
unused loop variable named `lab`; update that variable to `_lab` in the loop
header (for cx, val, col, _lab in zip(centers, values, colors, labels,
strict=True)) to signal it's intentionally unused while preserving the zip
ordering—locate this change in the loop inside throughput_reporter.py where
`centers`, `values`, `colors`, and `labels` are iterated.
- Line 157: The SVG f-string in throughput_reporter.py (e.g., the parts.append
call that builds the <text ...>{gv:.1f}</text>) exceeds the 120-char line
length; refactor these SVG generation lines (including the similar ones at the
locations around lines 169-171 and 173) by splitting the long f-strings into
smaller parts or intermediate variables (e.g., compute x_attr, y_attr, attrs,
and text_content separately) and then join/concatenate them in a short
parts.append call so each source line stays under the 120-character limit while
preserving the same output and formatting.

In `@tests/test_test_scenario.py`:
- Around line 47-50: Remove the trailing whitespace after the imported symbol
MoEBenchmarkReportGenerationStrategy in the import statement that also includes
MoEBenchmarkTestDefinition; edit the import line so it reads without any extra
spaces after MoEBenchmarkReportGenerationStrategy, preserving the same symbols
and line breaks to satisfy the pre-commit hook.

---

Outside diff comments:
In `@conf/experimental/test/moe_benchmark_standard.toml`:
- Around line 34-36: Remove the trailing blank lines at the end of
conf/experimental/test/moe_benchmark_standard.toml so the file ends immediately
after its last TOML entry (delete the extra blank lines at EOF that were added);
ensure there is no extra newline-only lines after the final content so the
end-of-file-fixer pre-commit hook is satisfied.

In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py`:
- Around line 23-57: The function _moe_benchmark_dispatch_combine_bars is too
complex; extract the row-filtering/value-extraction and file-reading into small
helpers to reduce cyclomatic complexity. Create a helper like
_read_latest_results(test_output) to return parsed rows (or [] on error) and
another helper like _extract_bus_bw(row) that validates a row dict, ensures
"operation" is a str and contains "bus_bw_avg", normalizes operation to
lowercase and returns (op_l, float(bus_bw_avg)) or None on failure; then
simplify _moe_benchmark_dispatch_combine_bars to call _read_latest_results and
iterate the rows using _extract_bus_bw, populate by_op and build the out list
(keep existing tuple labels "MoE dispatch"/"MoE combine" and colors).

In `@tests/test_acceptance.py`:
- Line 721: The test test_sbatch_generation[moe-benchmark] fails because the
expected reference sbatch file is missing; add a new reference file named
moe-benchmark.sbatch under the tests/ref_data directory (the path constructed by
Path(__file__).parent / "ref_data" / test_req[1] in tests/test_acceptance.py)
containing the expected sbatch output for the moe-benchmark workload so the
assertion comparing generated SBATCH to the reference succeeds. Ensure the file
name exactly matches "moe-benchmark.sbatch" and the contents match the generator
output used by the test.

In `@tests/test_init.py`:
- Line 241: Update the assertion that checks the number of test definitions from
26 to the correct count to include the newly added MoEBenchmarkTestDefinition;
locate the assertion referencing test_defs (assert len(test_defs) == 26) in
tests/test_init.py and change the expected value to 27 so the test reflects the
added MoEBenchmarkTestDefinition.

---

Duplicate comments:
In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py`:
- Line 15: The file imports Reporter from the private module
cloudai._core.base_reporter which breaks layering rules; update the import to
use the public Reporter export (e.g., replace "from cloudai._core.base_reporter
import Reporter" with the public import such as "from cloudai.reporter import
Reporter" or the correct public module that exposes Reporter) so that
throughput_reporter.py only depends on the public API.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 25bf978f-ad63-4710-8250-1199086f3041

📥 Commits

Reviewing files that changed from the base of the PR and between 3a743d3 and 72a6f84.

📒 Files selected for processing (28)

.gitignore
conf/experimental/test/deepep_low_latency.toml
conf/experimental/test/deepep_standard.toml
conf/experimental/test/deepep_test_ep_v2.toml
conf/experimental/test/deepep_test_internode.toml
conf/experimental/test/deepep_test_intranode.toml
conf/experimental/test/deepep_test_low_latency.toml
conf/experimental/test/moe_benchmark_low_latency.toml
conf/experimental/test/moe_benchmark_standard.toml
conf/experimental/test/nccl_test_alltoallv.toml
conf/experimental/test/ucc_alltoallv_deepep.toml
conf/experimental/test_scenario/deepep_official.toml
conf/experimental/test_scenario/moe_benchmark.toml
src/cloudai/registration.py
src/cloudai/workloads/deepep/__init__.py
src/cloudai/workloads/deepep/deepep.py
src/cloudai/workloads/deepep/slurm_command_gen_strategy.py
src/cloudai/workloads/moe_benchmark/__init__.py
src/cloudai/workloads/moe_benchmark/combined_report.py
src/cloudai/workloads/moe_benchmark/moe_benchmark.py
src/cloudai/workloads/moe_benchmark/report_generation_strategy.py
src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py
src/cloudai/workloads/moe_benchmark/throughput_reporter.py
src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py
src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py
tests/test_acceptance.py
tests/test_init.py
tests/test_test_scenario.py

💤 Files with no reviewable changes (3)

conf/experimental/test/deepep_low_latency.toml
conf/experimental/test/deepep_standard.toml
src/cloudai/workloads/deepep/init.py

coderabbitai · 2026-06-02T14:15:02Z

+[[Tests]]
+id = "Tests.nccl_alltoallv"
+test_name = "nccl_test_alltoallv"
+num_nodes = 2
+time_limit = "00:30:00"
+  [[Tests.dependencies]]
+  type = "start_post_comp"
+  id = "Tests.ucc_alltoallv"


🧹 Nitpick | 🔵 Trivial | ⚖️ Poor tradeoff

Verify the dependency chain design: serial vs. parallel execution.

The current configuration chains tests serially: moe_benchmark → ucc_alltoallv → nccl_alltoallv. This means NCCL test waits for UCC test to complete, even though both tests independently consume matrices from the MoE benchmark output (as indicated by layers 10 and 12 in the stack context).

If both UCC and NCCL tests only need the MoE benchmark output and don't depend on each other, consider having both depend directly on Tests.moe_benchmark:

[[Tests]] id = "Tests.ucc_alltoallv" test_name = "ucc_alltoallv_deepep" num_nodes = 2 time_limit = "00:30:00" [[Tests.dependencies]] type = "start_post_comp" id = "Tests.moe_benchmark" [[Tests]] id = "Tests.nccl_alltoallv" test_name = "nccl_test_alltoallv" num_nodes = 2 time_limit = "00:30:00" [[Tests.dependencies]] type = "start_post_comp" id = "Tests.moe_benchmark" # Changed from Tests.ucc_alltoallv

This would allow UCC and NCCL tests to run in parallel after MoE benchmark completes, reducing total scenario runtime. The current serial chain may be intentional for resource management or debugging—please confirm this design choice.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@conf/experimental/test_scenario/moe_benchmark.toml` around lines 22 - 29, The Tests chain currently makes Tests.nccl_alltoallv depend on Tests.ucc_alltoallv (serial), causing NCCL to wait for UCC; if both only require the MoE output, change the dependency of Tests.ucc_alltoallv and Tests.nccl_alltoallv to depend directly on Tests.moe_benchmark (replace the dependency id in the [[Tests.dependencies]] block for both Tests.ucc_alltoallv and Tests.nccl_alltoallv to "Tests.moe_benchmark") so they start in parallel after the MoE benchmark; if the serial ordering was intentional for resource or debugging reasons, confirm and leave as-is.

coderabbitai · 2026-06-02T14:15:03Z

+        parts: list[str] = [
            "torchrun",
            f"--nnodes={num_nodes}",
            "--nproc_per_node=1",
-            "--rdzv_id=$RANDOM",
+            "--rdzv_id=\\${SLURM_JOB_ID:-0}",
            "--rdzv_backend=c10d",
-            "--rdzv_endpoint=$head_node_ip:29500",
-            benchmark_script,
-            str(config_file_path.absolute()),
-        ]
-
-        return command_parts
-
-    def _generate_config_yaml(self, config_path: Path, cmd_args: DeepEPCmdArgs) -> None:
-        """
-        Generate YAML configuration file for DeepEP benchmark.
-
-        Args:
-            config_path: Path where to write the config file.
-            cmd_args: Command arguments for the benchmark.
-        """
-        tdef: DeepEPTestDefinition = cast(DeepEPTestDefinition, self.test_run.test)
-
-        config_lines = [
-            "# DeepEP Benchmark Configuration",
-            "# Generated by CloudAI",
-            "",
+            "--rdzv_endpoint=\\${MASTER_ADDR}:\\${MASTER_PORT}",
+            self._script_path(cmd_args),


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash printf '\n# DeepEP cmd args\n' sed -n '24,75p' src/cloudai/workloads/deepep/deepep.py printf '\n# Torch launcher construction\n' sed -n '179,194p' src/cloudai/workloads/deepep/slurm_command_gen_strategy.py printf '\n# Config surface using these fields\n' rg -n --type=toml '\b(python_executable|num_processes)\b' conf

Repository: NVIDIA/cloudai

Length of output: 2706

Honor DeepEPCmdArgs launcher settings (python_executable/num_processes) in SLURM torchrun command.

In src/cloudai/workloads/deepep/slurm_command_gen_strategy.py (lines 182-189), generate_test_command() hardcodes torchrun --nproc_per_node=1 while DeepEPCmdArgs.num_processes is exposed (default 8, and set to 8 in the DeepEP test TOMLs) and is passed to the script as --num-processes. This means the configured process count isn’t reflected in the actual torch distributed worker topology. Also, DeepEPCmdArgs.python_executable is never used when constructing the launcher.

Update the torchrun command to derive --nproc_per_node from cmd_args.num_processes (with an explicit global-vs-per-node conversion) and to use cmd_args.python_executable when building the launcher.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/deepep/slurm_command_gen_strategy.py` around lines 182 - 189, generate_test_command currently hardcodes "torchrun --nproc_per_node=1" and ignores DeepEPCmdArgs.python_executable; compute the per-node process count from cmd_args.num_processes and num_nodes (e.g., per_node = max(1, int(cmd_args.num_processes / num_nodes)) and validate evenly divisible or handle remainder), then set "--nproc_per_node={per_node}" instead of the fixed 1, and prepend cmd_args.python_executable when building the launcher invocation (use cmd_args.python_executable + " " + self._script_path(cmd_args) or equivalent) so generate_test_command and the launcher respect DeepEPCmdArgs.num_processes and python_executable.

ybenvidia and others added 6 commits March 5, 2026 13:09

add tuning option

e912a9c

Merge branch 'NVIDIA:main' into dp-benchmark

aee7ebb

Merge branch 'NVIDIA:main' into dp-benchmark

ed01ea8

Merge branch 'NVIDIA:main' into dp-benchmark

8b029c9

add ucc/nccl all2allv

1b85a6e

Merge branch 'NVIDIA:main' into dp-benchmark

12f6742

ybenvidia requested review from jeffnvidia, podkidyshev and srivatsankrishnan as code owners May 14, 2026 13:58

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

ybenvidia and others added 2 commits June 2, 2026 12:01

Merge branch 'NVIDIA:main' into main

3a743d3

add deepep v1/v2 and first version of the moe-benchmark

72a6f84

ybenvidia changed the title ~~Add UCC/NCCL alltoallv vs DeepEP~~ Add UCC/NCCL alltoallv, deepEP va/v2 and moe benchmark Jun 2, 2026

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

fix pytest

f81856e

ybenvidia changed the title ~~Add UCC/NCCL alltoallv, deepEP va/v2 and moe benchmark~~ Add UCC/NCCL alltoallv, deepEP v1/v2 and moe benchmark Jun 2, 2026

Conversation

ybenvidia commented May 14, 2026

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

podkidyshev commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 14, 2026 •

edited

Loading

podkidyshev commented May 15, 2026 •

edited

Loading