Skip to content

AIDynamo: Optional restart of DynamoRouter between AIPerf re-runs#908

Merged
podkidyshev merged 7 commits into
mainfrom
ipod/dynamo-clear-cache
Jun 2, 2026
Merged

AIDynamo: Optional restart of DynamoRouter between AIPerf re-runs#908
podkidyshev merged 7 commits into
mainfrom
ipod/dynamo-clear-cache

Conversation

@podkidyshev
Copy link
Copy Markdown
Contributor

@podkidyshev podkidyshev commented Jun 1, 2026

Summary

  • Add multi-phase AIPerf execution for AIDynamo with base config plus per-phase overrides.
  • Preserve single-run artifact layout while writing per-phase logs/reports for multi-run scenarios.
  • Add an explicit between-phase bash hook for cache cleanup or router restart; default is a no-op.
  • Document LMCache propagation, AIPerf phases, server-metrics/DCGM usage, and DSE exclusions.

Test Plan

  • Automated CI
  • Manual runs

Additional Notes

@podkidyshev podkidyshev self-assigned this Jun 1, 2026
@podkidyshev podkidyshev added the enhancement New feature or request label Jun 1, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 8ed96fb8-2a76-4ff4-a06f-355836dd323b

📥 Commits

Reviewing files that changed from the base of the PR and between 1d20608 and c385839.

📒 Files selected for processing (3)
  • conf/experimental/ai_dynamo/test/sglang.toml
  • conf/experimental/ai_dynamo/test/vllm.toml
  • doc/workloads/ai_dynamo.rst

📝 Walkthrough

Walkthrough

Adds optional between_phase_cmd fields, inserts logged between-phase bash blocks into generated multi-phase AIPerf scripts, refactors router lifecycle to a generated routerctl.sh with readiness polling and start/stop commands, and updates related configs, tests, and docs.

Changes

Between-Phase AIPerf Execution

Layer / File(s) Summary
Configuration schema and benchmark examples
src/cloudai/workloads/ai_dynamo/ai_dynamo.py, conf/experimental/ai_dynamo/test/*
Adds between_phase_cmd: str | None to AIPerf (default "true") and AIPerfPhase (default None) with aliasing; adds cmd_args.dynamo.ingress-cmd entries and updates --extra-inputs in test configs to include stop: ["\n"].
Between-phase command script generation & tests
src/cloudai/workloads/ai_dynamo/slurm_command_gen_strategy.py, tests/ref_data/ai-dynamo-aiperf.sh, tests/workloads/ai_dynamo/test_command_gen_strategy_slurm.py
Adds _render_between_aiperf_phases_block to render logged bash snippets for a between-phase command; inserts these blocks between non-final phases in the generated script; reference script and unit tests updated to assert the inserted command.
Router control script infrastructure
src/cloudai/workloads/ai_dynamo/ai_dynamo.sh
Adds write_routerctl to produce routerctl.sh supporting start
Router startup and shutdown integration
src/cloudai/workloads/ai_dynamo/ai_dynamo.sh
launch_ingress now writes/uses routerctl.sh start and runs synchronously to gate on readiness; start_router delegates to controller; perform_exit stops the router via routerctl.sh stop on shutdown.
Documentation
doc/workloads/ai_dynamo.rst, doc/USER_GUIDE.rst
Adds 'AIPerf Multi-Phase Runs' section and clarifies dse_excluded_args behavior and scope.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/cloudai#907: Multi-phase AIPerf runner with inter-phase/phase handling; related to script-generation and inter-phase behavior.

Suggested reviewers

  • srivatsankrishnan
  • jeffnvidia
  • amaslenn

Poem

🐰 I hop between each AIPerf round,

I log the step and tap the ground.
routerctl hums, I wait for ping,
curl and bash — then onward spring.
A tiny hop, the next phase found.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main feature: optional DynamoRouter restart between AIPerf re-runs, which aligns with the between-phase restart/cleanup capability added throughout the changeset.
Description check ✅ Passed The description clearly relates to the changeset, covering multi-phase AIPerf execution, artifact layout preservation, between-phase bash hooks, and documentation of new features.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ipod/dynamo-clear-cache

Comment @coderabbitai help to get the list of available commands and usage tips.

@podkidyshev podkidyshev changed the title AIDynamo: Optional cleanup of DynamoRouter between AIPerf re-runs AIDynamo: Optional restart of DynamoRouter between AIPerf re-runs Jun 1, 2026
@podkidyshev podkidyshev marked this pull request as ready for review June 2, 2026 11:14
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@doc/workloads/ai_dynamo.rst`:
- Around line 124-126: Update the text for cmd_args.aiperf and
cmd_args.aiperf_phases to state clearly that phases run against the same live
Dynamo stack by default without restarting prefill, decode, or router processes
(i.e., “no restart unless explicitly configured”), remove the contradictory
recommendation to restart the router between phases, and replace the unsupported
example `routerctl.sh restart --reset-states` with a supported invocation such
as `routerctl.sh restart` (and mirror the same wording/command change at the
other occurrence referenced).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 7db27779-e5b1-467c-ba44-d6b471fdbd57

📥 Commits

Reviewing files that changed from the base of the PR and between 28a7355 and a10d29e.

📒 Files selected for processing (3)
  • doc/workloads/ai_dynamo.rst
  • src/cloudai/workloads/ai_dynamo/ai_dynamo.py
  • tests/ref_data/ai-dynamo-aiperf.sh

Comment thread doc/workloads/ai_dynamo.rst
@podkidyshev podkidyshev merged commit 4d8bbd3 into main Jun 2, 2026
5 checks passed
@podkidyshev podkidyshev deleted the ipod/dynamo-clear-cache branch June 2, 2026 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants