fix: generate MAD_MULTI_NODE_RUNNER for Docker local deployment#126
Merged
Conversation
Docker local mode had no mechanism to set MAD_MULTI_NODE_RUNNER, unlike SLURM (job.sh.j2) and K8s (kubernetes_launcher_mixin.py) which generate it in their deployment layers. This caused train_7b.sh (Megatron-LM) to fail with 'Error: MULTI_NODE_RUNNER is not defined' on local runs. Add _generate_local_launcher_command() to ContainerRunner that generates the appropriate single-node distributed process launcher command, and call it in run_container() after GPU resolution. Reuses the already-resolved launcher variable (lines 327-372) to stay consistent with existing launcher parsing conventions. Defaults to torchrun for backward compatibility. Supports torchrun, megatron-lm, torchtitan, deepspeed, vllm, sglang, and primus launchers. The env var is simply unused for models that hardcode their own launcher (e.g. HuggingFace scripts). Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR aims to align Docker local execution with SLURM/Kubernetes by generating and injecting MAD_MULTI_NODE_RUNNER for single-node distributed runs, preventing model scripts that rely on $MAD_MULTI_NODE_RUNNER from failing in local Docker mode.
Changes:
- Add
ContainerRunner._generate_local_launcher_command()to map launcher types (torchrun/deepspeed/etc.) to a single-node launcher command. - In
ContainerRunner.run_container(), generateMAD_MULTI_NODE_RUNNERafter GPU resolution and inject it intodocker_env_varsif the user didn’t already provide it.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The MAD_MULTI_NODE_RUNNER generation block in run_container() referenced `launcher` from create_run_details_dict(), a different method's local scope. Resolve the launcher inline using the same priority chain (additional_context → model_info → MAD_LAUNCHER env). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
coketaste
added a commit
that referenced
this pull request
May 28, 2026
Resolves the CHANGELOG conflict by adopting upstream's v2.0.3 release date (2026-05-26 — finalized upstream via #122) and graduating this branch's Unreleased entries into a new v2.1.0 section dated 2026-05-28, since slurm_multi / --use-image / --build-on-compute are feature work. Auto-merged from upstream: - fix: generate MAD_MULTI_NODE_RUNNER for Docker local deployment (#126) - docs/wiki/index.html (wiki path rename, #129)
This was referenced May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MAD_MULTI_NODE_RUNNER, unlike SLURM and K8s which generate it in their deployment layers. This caused distributed training scripts (e.g.Megatron-LM
train_7b.sh) to fail withError: MULTI_NODE_RUNNER is not definedon local runs._generate_local_launcher_command()toContainerRunnerand calls it inrun_container()after GPU resolution, generating the appropriate single-node launcher command(torchrun, deepspeed, etc.) and injecting it as the
MAD_MULTI_NODE_RUNNERenv var.MAD_MULTI_NODE_RUNNER— the env var is simply unused.Details
SLURM sets
MAD_MULTI_NODE_RUNNERinjob.sh.j2and K8s sets it inkubernetes_launcher_mixin.py. Docker local had no equivalent, so any model script referencing$MAD_MULTI_NODE_RUNNERwould fail.The fix reuses the already-resolved
launchervariable inrun_container()and maps it to the correct single-node command:torchrun/megatron/torchtitan→torchrun --standalone --nproc_per_node=Ndeepspeed→deepspeed --num_gpus=Nvllm/sglang/primus→ empty (these manage their own process spawning)torchrunwhenlauncheris a deployment-level value like"docker"or"native"Only sets the env var if the user hasn't already provided it via
docker_env_vars.Test plan
$MAD_MULTI_NODE_RUNNER(e.g. Megatron-LM train_7b) locally — verify it no longer fails withMULTI_NODE_RUNNER is not defined$MAD_MULTI_NODE_RUNNER(e.g. HuggingFace script) — verify no regressionContainerRunner.run_container())