AIDynamo: shared node disagg inference by podkidyshev · Pull Request #909 · NVIDIA/cloudai

podkidyshev · 2026-06-02T17:06:51Z

Summary

Adds fork-style shared-node disaggregated AIDynamo on Slurm: if top-level num_nodes is smaller than prefill_worker.num-nodes + decode_worker.num-nodes, prefill and decode role node lists overlap and share the allocated nodes.
Keeps existing separate-node behavior for configs that omit num_nodes; CloudAI still expands the allocation to the role-node sum for backward compatibility.
Splits GPUs per shared node by role: decode gets the first TP * PP GPUs, prefill gets the next TP * PP GPUs, with early validation when the combined role GPU count does not fit.
Offsets per-role worker ports in shared-node mode so decode/prefill workers on the same host do not collide.
Adds a vLLM shared-node scenario with one AIPerf phase and inherited accuracy run enabled.

Config Examples

Separate-node disagg, existing default behavior:

[[Tests]]
id = "test.disagg.separate"
test_name = "vLLM"

  [Tests.cmd_args.dynamo.prefill_worker]
  num-nodes = 1

  [Tests.cmd_args.dynamo.decode_worker]
  num-nodes = 1

Shared-node disagg on two physical nodes:

[[Tests]]
id = "test.disagg.shared"
test_name = "vLLM"
num_nodes = 2

  [Tests.cmd_args.dynamo.prefill_worker]
  num-nodes = 2
    [Tests.cmd_args.dynamo.prefill_worker.args]
    tensor-parallel-size = 4
    pipeline-parallel-size = 1

  [Tests.cmd_args.dynamo.decode_worker]
  num-nodes = 2
    [Tests.cmd_args.dynamo.decode_worker.args]
    tensor-parallel-size = 4
    pipeline-parallel-size = 1

With 8 GPUs per node, this runs decode on GPUs 0-3 and prefill on GPUs 4-7 on each allocated node.

Explicit node allocation, no real hostnames:

[[Tests]]
id = "test.disagg.shared-explicit"
test_name = "vLLM"
nodes = ["node-a", "node-b"]

  [Tests.cmd_args.dynamo.prefill_worker]
  num-nodes = 2
  nodes = "node-a,node-b"

  [Tests.cmd_args.dynamo.decode_worker]
  num-nodes = 2
  nodes = "node-a,node-b"

Overlapping worker nodes means shared-node mode. Non-overlapping worker nodes remain separate-node mode.

Test Plan

Automated CI
Manual runs

coderabbitai · 2026-06-02T17:06:59Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: aec77906-a41d-4d29-93b2-a567546056b0

📥 Commits

Reviewing files that changed from the base of the PR and between 7f72c27 and ca2e0e2.

📒 Files selected for processing (2)

doc/workloads/ai_dynamo.rst
src/cloudai/workloads/ai_dynamo/ai_dynamo.sh

📝 Walkthrough

Walkthrough

Adds shared-node disaggregation: track explicit num_nodes, detect overlapping prefill/decode node lists, compute and validate role node/GPU sizing, emit per-role node-list flags, and implement runtime GPU-slicing and per-worker system-port/metrics offsets.

Changes

Shared-Node Disaggregation Implementation

Layer / File(s)	Summary
Data Model: Explicit Node Count Tracking `src/cloudai/_core/test_scenario.py`, `src/cloudai/test_scenario_parser.py`	`TestRun` now records `num_nodes_explicit` to distinguish user-set vs. computed node counts.
Command Strategy: Node Allocation and Disaggregation Logic `src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py`	Node-list parsing, deduplication, overlap detection, role-node computation, conditional `--prefill-node-list`/`--decode-node-list` emission, and revised cached-node spec validation for shared-node runs.
Constraint Validation: Shared-Node GPU Limits `src/cloudai/workloads/ai_dynamo/ai_dynamo.py`	`constraint_check` enforces that when roles share a node, (prefill_tpprefill_pp + decode_tpdecode_pp) must not exceed `gpus_per_node`.
Shell Script Runtime: Worker Orchestration and Port Management `src/cloudai/workloads/ai_dynamo/ai_dynamo.sh`	Detect overlapping node lists to enable `SHARED_NODE_DISAGG`, compute GPU offsets and force `workers-per-node=1` for slices, select GPU lists with offsets, assign per-worker `system_port` from `DYN_SYSTEM_PORT`, and adjust aiperf metrics URLs.
Test Coverage: Node Disaggregation and Constraints `tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py`	New unit tests for explicit smaller `num_nodes` preservation, role-sum resolution when omitted, overlapping node-list acceptance, rejection when extra nodes are allocated, and constraint checks for shared vs separate-node GPU allocations.
Configuration and Documentation `conf/experimental/ai_dynamo/test_scenario/vllm_slurm.toml`, `doc/workloads/ai_dynamo.rst`	Added a shared-node vLLM test scenario and documentation subsection describing Slurm sizing, decode-first GPU assignment order, and the combined-role GPU-on-one-node constraint.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

NVIDIA/cloudai#907: related changes around AIPerf/server-metrics and multi-phase aiperf configuration.

Suggested labels

enhancement

Suggested reviewers

srivatsankrishnan
jeffnvidia
amaslenn

Poem

🐰 I hopped through nodes both near and wide,
Split GPUs and ports so roles could bide,
Prefill and decode now dance side by side,
One node, two jobs — no more collide,
Carrots for tests, configuration, and pride.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and concisely describes the main feature: adding shared-node disaggregated inference capability to AIDynamo.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, providing clear context about shared-node vs separate-node behavior, GPU splitting strategy, port management, and concrete configuration examples.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ipod/dynamo-disagg-shared

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@doc/workloads/ai_dynamo.rst`:
- Around line 306-313: Update the earlier Slurm node-count guidance to
explicitly document the shared-node disaggregation exception: state that
normally required node count equals num_prefill_nodes + num_decode_nodes, but if
you set top-level num_nodes lower than prefill_worker.num-nodes +
decode_worker.num-nodes you enter Shared-Node Disaggregated mode where CloudAI
assigns decode GPUs first then prefill GPUs based on each role's
tensor-parallel-size * pipeline-parallel-size; reference the parameters
num_nodes, prefill_worker.num-nodes, decode_worker.num-nodes,
tensor-parallel-size and pipeline-parallel-size and mention CloudAI’s behavior
so users know this is an explicit exception rather than a contradiction.

In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py`:
- Line 607: constraint_check may be called with WorkerConfig.num_nodes still
holding a list which causes an unclear TypeError when computing
role_total_nodes; in the constraint_check (or just before computing
role_total_nodes) validate that prefill_worker.num_nodes and
decode_worker.num_nodes are scalar (not list/tuple), and if they are list-like
raise a clear TypeError explaining that apply_params_set/unroll_dse must have
been run (mention WorkerConfig.num_nodes, apply_params_set, unroll_dse) and
include which role (prefill or decode) has a list value so callers get an
actionable error instead of relying on int(list) behavior.

In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.sh`:
- Around line 1024-1025: The SC2155 warning is about declaring and assigning in
one statement: in the for-loop where you call _gpu_list_for_worker_offset,
change the single-line "local gpu_list=$(...)" to two statements so the local
declaration and the assignment are separate; locate the loop using the for i in
$(seq ...) and the variable gpu_list and update to first "local gpu_list" then
"gpu_list=$(_gpu_list_for_worker_offset
\"${prefill_config[\"gpus-per-worker\"]}\" \"$i\" \"$gpu_offset\")".
- Around line 540-547: The helper function _gpu_list_for_worker_offset uses an
unnecessary echo and an unquoted variable; replace the subshell echo with a
direct cut reading the CUDA_VISIBLE_DEVICES content and quote expansions to
avoid word-splitting. Update _gpu_list_for_worker_offset to compute start/end as
before but call cut like: cut -d',' -f${start}-${end} <<<"$CUDA_VISIBLE_DEVICES"
(or printf '%s' "$CUDA_VISIBLE_DEVICES" | cut ...), and ensure positional inputs
(per_worker, idx, offset) are referenced as quoted variables where used to
satisfy shellcheck SC2005/SC2086.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 74871532-aacc-4d53-af88-c6e478a67f6f

📥 Commits

Reviewing files that changed from the base of the PR and between 4d8bbd3 and 7f72c27.

📒 Files selected for processing (8)

conf/experimental/ai_dynamo/test_scenario/vllm_slurm.toml
doc/workloads/ai_dynamo.rst
src/cloudai/_core/test_scenario.py
src/cloudai/test_scenario_parser.py
src/cloudai/workloads/ai_dynamo/ai_dynamo.py
src/cloudai/workloads/ai_dynamo/ai_dynamo.sh
src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py
tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py

shared disagg

7f72c27

podkidyshev marked this pull request as ready for review June 2, 2026 17:20

podkidyshev requested review from jeffnvidia and srivatsankrishnan as code owners June 2, 2026 17:20

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread doc/workloads/ai_dynamo.rst

Comment thread src/cloudai/workloads/ai_dynamo/ai_dynamo.py

Comment thread src/cloudai/workloads/ai_dynamo/ai_dynamo.sh

Comment thread src/cloudai/workloads/ai_dynamo/ai_dynamo.sh Outdated

resolve ai comments

ca2e0e2

amaslenn approved these changes Jun 2, 2026

View reviewed changes

podkidyshev merged commit 251e888 into main Jun 2, 2026
5 checks passed

podkidyshev deleted the ipod/dynamo-disagg-shared branch June 2, 2026 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AIDynamo: shared node disagg inference#909

AIDynamo: shared node disagg inference#909
podkidyshev merged 2 commits into
mainfrom
ipod/dynamo-disagg-shared

podkidyshev commented Jun 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

podkidyshev commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Config Examples

Test Plan

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

podkidyshev commented Jun 2, 2026 •

edited

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading