Skip to content

AIDynamo: shared node disagg inference#909

Merged
podkidyshev merged 2 commits into
mainfrom
ipod/dynamo-disagg-shared
Jun 2, 2026
Merged

AIDynamo: shared node disagg inference#909
podkidyshev merged 2 commits into
mainfrom
ipod/dynamo-disagg-shared

Conversation

@podkidyshev
Copy link
Copy Markdown
Contributor

@podkidyshev podkidyshev commented Jun 2, 2026

Summary

  • Adds fork-style shared-node disaggregated AIDynamo on Slurm: if top-level num_nodes is smaller than prefill_worker.num-nodes + decode_worker.num-nodes, prefill and decode role node lists overlap and share the allocated nodes.
  • Keeps existing separate-node behavior for configs that omit num_nodes; CloudAI still expands the allocation to the role-node sum for backward compatibility.
  • Splits GPUs per shared node by role: decode gets the first TP * PP GPUs, prefill gets the next TP * PP GPUs, with early validation when the combined role GPU count does not fit.
  • Offsets per-role worker ports in shared-node mode so decode/prefill workers on the same host do not collide.
  • Adds a vLLM shared-node scenario with one AIPerf phase and inherited accuracy run enabled.

Config Examples

Separate-node disagg, existing default behavior:

[[Tests]]
id = "test.disagg.separate"
test_name = "vLLM"

  [Tests.cmd_args.dynamo.prefill_worker]
  num-nodes = 1

  [Tests.cmd_args.dynamo.decode_worker]
  num-nodes = 1

Shared-node disagg on two physical nodes:

[[Tests]]
id = "test.disagg.shared"
test_name = "vLLM"
num_nodes = 2

  [Tests.cmd_args.dynamo.prefill_worker]
  num-nodes = 2
    [Tests.cmd_args.dynamo.prefill_worker.args]
    tensor-parallel-size = 4
    pipeline-parallel-size = 1

  [Tests.cmd_args.dynamo.decode_worker]
  num-nodes = 2
    [Tests.cmd_args.dynamo.decode_worker.args]
    tensor-parallel-size = 4
    pipeline-parallel-size = 1

With 8 GPUs per node, this runs decode on GPUs 0-3 and prefill on GPUs 4-7 on each allocated node.

Explicit node allocation, no real hostnames:

[[Tests]]
id = "test.disagg.shared-explicit"
test_name = "vLLM"
nodes = ["node-a", "node-b"]

  [Tests.cmd_args.dynamo.prefill_worker]
  num-nodes = 2
  nodes = "node-a,node-b"

  [Tests.cmd_args.dynamo.decode_worker]
  num-nodes = 2
  nodes = "node-a,node-b"

Overlapping worker nodes means shared-node mode. Non-overlapping worker nodes remain separate-node mode.

Test Plan

  • Automated CI
  • Manual runs

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: aec77906-a41d-4d29-93b2-a567546056b0

📥 Commits

Reviewing files that changed from the base of the PR and between 7f72c27 and ca2e0e2.

📒 Files selected for processing (2)
  • doc/workloads/ai_dynamo.rst
  • src/cloudai/workloads/ai_dynamo/ai_dynamo.sh

📝 Walkthrough

Walkthrough

Adds shared-node disaggregation: track explicit num_nodes, detect overlapping prefill/decode node lists, compute and validate role node/GPU sizing, emit per-role node-list flags, and implement runtime GPU-slicing and per-worker system-port/metrics offsets.

Changes

Shared-Node Disaggregation Implementation

Layer / File(s) Summary
Data Model: Explicit Node Count Tracking
src/cloudai/_core/test_scenario.py, src/cloudai/test_scenario_parser.py
TestRun now records num_nodes_explicit to distinguish user-set vs. computed node counts.
Command Strategy: Node Allocation and Disaggregation Logic
src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py
Node-list parsing, deduplication, overlap detection, role-node computation, conditional --prefill-node-list/--decode-node-list emission, and revised cached-node spec validation for shared-node runs.
Constraint Validation: Shared-Node GPU Limits
src/cloudai/workloads/ai_dynamo/ai_dynamo.py
constraint_check enforces that when roles share a node, (prefill_tpprefill_pp + decode_tpdecode_pp) must not exceed gpus_per_node.
Shell Script Runtime: Worker Orchestration and Port Management
src/cloudai/workloads/ai_dynamo/ai_dynamo.sh
Detect overlapping node lists to enable SHARED_NODE_DISAGG, compute GPU offsets and force workers-per-node=1 for slices, select GPU lists with offsets, assign per-worker system_port from DYN_SYSTEM_PORT, and adjust aiperf metrics URLs.
Test Coverage: Node Disaggregation and Constraints
tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py
New unit tests for explicit smaller num_nodes preservation, role-sum resolution when omitted, overlapping node-list acceptance, rejection when extra nodes are allocated, and constraint checks for shared vs separate-node GPU allocations.
Configuration and Documentation
conf/experimental/ai_dynamo/test_scenario/vllm_slurm.toml, doc/workloads/ai_dynamo.rst
Added a shared-node vLLM test scenario and documentation subsection describing Slurm sizing, decode-first GPU assignment order, and the combined-role GPU-on-one-node constraint.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/cloudai#907: related changes around AIPerf/server-metrics and multi-phase aiperf configuration.

Suggested labels

enhancement

Suggested reviewers

  • srivatsankrishnan
  • jeffnvidia
  • amaslenn

Poem

🐰 I hopped through nodes both near and wide,
Split GPUs and ports so roles could bide,
Prefill and decode now dance side by side,
One node, two jobs — no more collide,
Carrots for tests, configuration, and pride.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and concisely describes the main feature: adding shared-node disaggregated inference capability to AIDynamo.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, providing clear context about shared-node vs separate-node behavior, GPU splitting strategy, port management, and concrete configuration examples.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ipod/dynamo-disagg-shared

Comment @coderabbitai help to get the list of available commands and usage tips.

@podkidyshev podkidyshev marked this pull request as ready for review June 2, 2026 17:20
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@doc/workloads/ai_dynamo.rst`:
- Around line 306-313: Update the earlier Slurm node-count guidance to
explicitly document the shared-node disaggregation exception: state that
normally required node count equals num_prefill_nodes + num_decode_nodes, but if
you set top-level num_nodes lower than prefill_worker.num-nodes +
decode_worker.num-nodes you enter Shared-Node Disaggregated mode where CloudAI
assigns decode GPUs first then prefill GPUs based on each role's
tensor-parallel-size * pipeline-parallel-size; reference the parameters
num_nodes, prefill_worker.num-nodes, decode_worker.num-nodes,
tensor-parallel-size and pipeline-parallel-size and mention CloudAI’s behavior
so users know this is an explicit exception rather than a contradiction.

In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.py`:
- Line 607: constraint_check may be called with WorkerConfig.num_nodes still
holding a list which causes an unclear TypeError when computing
role_total_nodes; in the constraint_check (or just before computing
role_total_nodes) validate that prefill_worker.num_nodes and
decode_worker.num_nodes are scalar (not list/tuple), and if they are list-like
raise a clear TypeError explaining that apply_params_set/unroll_dse must have
been run (mention WorkerConfig.num_nodes, apply_params_set, unroll_dse) and
include which role (prefill or decode) has a list value so callers get an
actionable error instead of relying on int(list) behavior.

In `@src/cloudai/workloads/ai_dynamo/ai_dynamo.sh`:
- Around line 1024-1025: The SC2155 warning is about declaring and assigning in
one statement: in the for-loop where you call _gpu_list_for_worker_offset,
change the single-line "local gpu_list=$(...)" to two statements so the local
declaration and the assignment are separate; locate the loop using the for i in
$(seq ...) and the variable gpu_list and update to first "local gpu_list" then
"gpu_list=$(_gpu_list_for_worker_offset
\"${prefill_config[\"gpus-per-worker\"]}\" \"$i\" \"$gpu_offset\")".
- Around line 540-547: The helper function _gpu_list_for_worker_offset uses an
unnecessary echo and an unquoted variable; replace the subshell echo with a
direct cut reading the CUDA_VISIBLE_DEVICES content and quote expansions to
avoid word-splitting. Update _gpu_list_for_worker_offset to compute start/end as
before but call cut like: cut -d',' -f${start}-${end} <<<"$CUDA_VISIBLE_DEVICES"
(or printf '%s' "$CUDA_VISIBLE_DEVICES" | cut ...), and ensure positional inputs
(per_worker, idx, offset) are referenced as quoted variables where used to
satisfy shellcheck SC2005/SC2086.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 74871532-aacc-4d53-af88-c6e478a67f6f

📥 Commits

Reviewing files that changed from the base of the PR and between 4d8bbd3 and 7f72c27.

📒 Files selected for processing (8)
  • conf/experimental/ai_dynamo/test_scenario/vllm_slurm.toml
  • doc/workloads/ai_dynamo.rst
  • src/cloudai/_core/test_scenario.py
  • src/cloudai/test_scenario_parser.py
  • src/cloudai/workloads/ai_dynamo/ai_dynamo.py
  • src/cloudai/workloads/ai_dynamo/ai_dynamo.sh
  • src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py
  • tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py

Comment thread doc/workloads/ai_dynamo.rst
Comment thread src/cloudai/workloads/ai_dynamo/ai_dynamo.py
Comment thread src/cloudai/workloads/ai_dynamo/ai_dynamo.sh
Comment thread src/cloudai/workloads/ai_dynamo/ai_dynamo.sh Outdated
@podkidyshev podkidyshev merged commit 251e888 into main Jun 2, 2026
5 checks passed
@podkidyshev podkidyshev deleted the ipod/dynamo-disagg-shared branch June 2, 2026 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants