fix: generate MAD_MULTI_NODE_RUNNER for Docker local deployment by coketaste · Pull Request #126 · ROCm/madengine

coketaste · 2026-05-19T22:38:02Z

Summary

Docker local mode had no mechanism to set MAD_MULTI_NODE_RUNNER, unlike SLURM and K8s which generate it in their deployment layers. This caused distributed training scripts (e.g.
Megatron-LM train_7b.sh) to fail with Error: MULTI_NODE_RUNNER is not defined on local runs.
Adds _generate_local_launcher_command() to ContainerRunner and calls it in run_container() after GPU resolution, generating the appropriate single-node launcher command
(torchrun, deepspeed, etc.) and injecting it as the MAD_MULTI_NODE_RUNNER env var.
No-op for models that don't use MAD_MULTI_NODE_RUNNER — the env var is simply unused.

Details

SLURM sets MAD_MULTI_NODE_RUNNER in job.sh.j2 and K8s sets it in kubernetes_launcher_mixin.py. Docker local had no equivalent, so any model script referencing
$MAD_MULTI_NODE_RUNNER would fail.

The fix reuses the already-resolved launcher variable in run_container() and maps it to the correct single-node command:

torchrun/megatron/torchtitan → torchrun --standalone --nproc_per_node=N
deepspeed → deepspeed --num_gpus=N
vllm/sglang/primus → empty (these manage their own process spawning)
Defaults to torchrun when launcher is a deployment-level value like "docker" or "native"

Only sets the env var if the user hasn't already provided it via docker_env_vars.

Test plan

Run a model that uses $MAD_MULTI_NODE_RUNNER (e.g. Megatron-LM train_7b) locally — verify it no longer fails with MULTI_NODE_RUNNER is not defined
Run a model that doesn't use $MAD_MULTI_NODE_RUNNER (e.g. HuggingFace script) — verify no regression
Verify SLURM/K8s paths are unaffected (change is scoped to ContainerRunner.run_container())

Docker local mode had no mechanism to set MAD_MULTI_NODE_RUNNER, unlike SLURM (job.sh.j2) and K8s (kubernetes_launcher_mixin.py) which generate it in their deployment layers. This caused train_7b.sh (Megatron-LM) to fail with 'Error: MULTI_NODE_RUNNER is not defined' on local runs. Add _generate_local_launcher_command() to ContainerRunner that generates the appropriate single-node distributed process launcher command, and call it in run_container() after GPU resolution. Reuses the already-resolved launcher variable (lines 327-372) to stay consistent with existing launcher parsing conventions. Defaults to torchrun for backward compatibility. Supports torchrun, megatron-lm, torchtitan, deepspeed, vllm, sglang, and primus launchers. The env var is simply unused for models that hardcode their own launcher (e.g. HuggingFace scripts). Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>

Copilot

Pull request overview

This PR aims to align Docker local execution with SLURM/Kubernetes by generating and injecting MAD_MULTI_NODE_RUNNER for single-node distributed runs, preventing model scripts that rely on $MAD_MULTI_NODE_RUNNER from failing in local Docker mode.

Changes:

Add ContainerRunner._generate_local_launcher_command() to map launcher types (torchrun/deepspeed/etc.) to a single-node launcher command.
In ContainerRunner.run_container(), generate MAD_MULTI_NODE_RUNNER after GPU resolution and inject it into docker_env_vars if the user didn’t already provide it.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The MAD_MULTI_NODE_RUNNER generation block in run_container() referenced `launcher` from create_run_details_dict(), a different method's local scope. Resolve the launcher inline using the same priority chain (additional_context → model_info → MAD_LAUNCHER env). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

Resolves the CHANGELOG conflict by adopting upstream's v2.0.3 release date (2026-05-26 — finalized upstream via #122) and graduating this branch's Unreleased entries into a new v2.1.0 section dated 2026-05-28, since slurm_multi / --use-image / --build-on-compute are feature work. Auto-merged from upstream: - fix: generate MAD_MULTI_NODE_RUNNER for Docker local deployment (#126) - docs/wiki/index.html (wiki path rename, #129)

coketaste self-assigned this May 19, 2026

Copilot AI review requested due to automatic review settings May 19, 2026 22:38

coketaste requested review from Cemberk, Rohan138, gargrahul and leconcio as code owners May 19, 2026 22:38

Copilot started reviewing on behalf of coketaste May 19, 2026 22:39 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread src/madengine/execution/container_runner.py

Comment thread src/madengine/execution/container_runner.py

Comment thread src/madengine/execution/container_runner.py

Comment thread src/madengine/execution/container_runner.py

coketaste and others added 2 commits May 22, 2026 15:03

Merge branch 'develop' into coketaste/fix-local-launcher

e7e2e31

Copilot AI review requested due to automatic review settings May 22, 2026 19:49

Copilot started reviewing on behalf of coketaste May 22, 2026 19:49 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread src/madengine/execution/container_runner.py

Comment thread src/madengine/execution/container_runner.py

Comment thread src/madengine/execution/container_runner.py

Comment thread src/madengine/execution/container_runner.py

coketaste added 2 commits May 26, 2026 16:31

Merge branch 'develop' into coketaste/fix-local-launcher

e7ded78

Updated CHANGELOG

2cd4f07

Copilot AI review requested due to automatic review settings May 27, 2026 00:13

Copilot started reviewing on behalf of coketaste May 27, 2026 00:13 View session

coketaste merged commit 8d86f45 into develop May 27, 2026
1 check failed

coketaste review requested due to automatic review settings May 27, 2026 00:34

This was referenced May 29, 2026

refactor: canonicalize launcher aliases and extract MAD_MULTI_NODE_RUNNER resolver #132

Open

docs(wiki): comprehensive codebase wiki rewrite for v2.1.0 #135

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: generate MAD_MULTI_NODE_RUNNER for Docker local deployment#126

fix: generate MAD_MULTI_NODE_RUNNER for Docker local deployment#126
coketaste merged 5 commits into
developfrom
coketaste/fix-local-launcher

coketaste commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

coketaste commented May 19, 2026

Summary

Details

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants