ROCm · coketaste · May 27, 2026 · May 19, 2026 · May 22, 2026 · May 22, 2026
@@ -5,7 +5,7 @@ All notable changes to madengine will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## [2.0.3] - 2026-05-19
+## [2.0.3] - 2026-05-26
 
 ### Added
 
@@ -41,6 +41,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 - **Deployment monitor infinite loop on cancelled jobs**: `BaseDeployment._monitor_job` treated only `COMPLETED`/`FAILED` as terminal, so a `CANCELLED` job (manual `scancel`, K8s job deletion, etc.) would loop forever waiting for a state that never arrived. `CANCELLED` is now in the terminal-state set in `deployment/base.py`.
 
+- **Docker local: missing `MAD_MULTI_NODE_RUNNER`**: SLURM (`job.sh.j2`) and Kubernetes (`kubernetes_launcher_mixin.py`) already export `MAD_MULTI_NODE_RUNNER` with the appropriate distributed launcher command, but local Docker runs had no equivalent. Models that delegate process spawning to `$MAD_MULTI_NODE_RUNNER` (e.g. Megatron-LM `train_7b.sh`) failed on `madengine run` with `MULTI_NODE_RUNNER is not defined`. `ContainerRunner` now resolves the launcher from `--additional-context` → model `distributed.launcher` → `MAD_LAUNCHER` (same priority chain as elsewhere), treats deployment-level values (`docker`, `native`) as `torchrun`, and sets `MAD_MULTI_NODE_RUNNER` via `_generate_local_launcher_command()` after GPU resolution (`MAD_RUNTIME_NGPUS`). Supports torchrun, megatron-lm, torchtitan, deepspeed, vllm, sglang, and primus; models that hardcode their own launcher (e.g. HuggingFace scripts) simply ignore the variable. Skipped when `MAD_MULTI_NODE_RUNNER` is already set in `docker_env_vars`.
+
 ### Security
 
 - **Shell injection hardening (extended)**: `shlex.quote()` is now applied to every shell interpolation of a user-controlled value across `core/docker.py`, `execution/container_runner.py`, `execution/docker_builder.py`, and `orchestration/run_orchestrator.py` (image names, paths, container names, build-args). A follow-up pass closed the last remaining sites in `docker_builder.py` (`grep`, `docker manifest inspect`, `docker tag`, `docker push`, `head`). This is a defence-in-depth extension of the v2.0.2 build-arg quoting work — values that flow through `--additional-context`, model configs, or registry credentials can no longer break out of the shell command they are embedded in.

@@ -657,6 +657,31 @@ def get_cpu_arg(self) -> str:
         cpus = self.context.ctx["docker_cpus"].replace(" ", "")
         return f"--cpuset-cpus {cpus} "
 
+    def _generate_local_launcher_command(self, launcher_type: str, nproc_per_node: int) -> str:
+        """Generate distributed process launcher command for Docker local deployment.
+
+        Docker local is always single-node. This parallels
+        SlurmDeployer._generate_launcher_command() and
+        KubernetesLauncherMixin._generate_torchrun_command() for the
+        Docker local path.
+
+        Args:
+            launcher_type: Distributed launcher (torchrun, megatron, deepspeed, etc.)
+            nproc_per_node: Number of GPUs (processes) per node.
+
+        Returns:
+            Launcher command string, or empty string for launchers that
+            manage their own process spawning (vllm, sglang).
+        """
+        if launcher_type in ("torchrun", "megatron", "megatron-lm", "torchtitan"):
+            return f"torchrun --standalone --nproc_per_node={nproc_per_node}"
+        elif launcher_type == "deepspeed":
+            return f"deepspeed --num_gpus={nproc_per_node}"
+        elif launcher_type in ("vllm", "sglang", "sglang-disagg", "primus"):
+            return ""
+        else:
+            return f"torchrun --standalone --nproc_per_node={nproc_per_node}"
+
     _ENV_KEY_RE = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")
 
     def get_env_arg(self, run_env: typing.Dict) -> str:
@@ -1089,7 +1114,35 @@ def run_container(
         resolved_gpu_count = resolve_runtime_gpus(model_info, self.additional_context)
         docker_options += self.get_gpu_arg(str(resolved_gpu_count))
         docker_options += self.get_cpu_arg()
-
+
+        # Generate MAD_MULTI_NODE_RUNNER for Docker local deployment.
+        # SLURM and K8s generate this in their deployment layers (slurm.py,
+        # kubernetes_launcher_mixin.py). Docker local has no such layer, so
+        # we generate it here after GPU resolution provides MAD_RUNTIME_NGPUS.
+        # Defaults to torchrun when launcher is a deployment-level value
+        # ("docker", "native") rather than a distributed launcher.
+        # For models that hardcode their own launcher (e.g. HuggingFace scripts
+        # calling torchrun directly), this env var is simply unused.
+        if "MAD_MULTI_NODE_RUNNER" not in self.context.ctx["docker_env_vars"]:
+            launcher = ""
+            if self.additional_context:
+                launcher = self.additional_context.get("distributed", {}).get("launcher", "")
+            if not launcher and model_info.get("distributed"):
+                launcher = model_info["distributed"].get("launcher", "")
+            if not launcher:
+                launcher = os.environ.get("MAD_LAUNCHER", "")
+            dist_launcher = launcher if launcher in (
+                "torchrun", "megatron", "megatron-lm", "torchtitan",
+                "deepspeed", "vllm", "sglang", "sglang-disagg", "primus",
+            ) else "torchrun"
+            runtime_ngpus = int(self.context.ctx["docker_env_vars"].get(
+                "MAD_RUNTIME_NGPUS", str(resolved_gpu_count)))
+            launcher_cmd = self._generate_local_launcher_command(dist_launcher, runtime_ngpus)
+            if launcher_cmd:
+                self.context.ctx["docker_env_vars"]["MAD_MULTI_NODE_RUNNER"] = launcher_cmd
+                print(f"ℹ️  Set MAD_MULTI_NODE_RUNNER for local deployment "
+                      f"(launcher={dist_launcher}): {launcher_cmd}")
+
         # Filter out MIOPEN_USER_DB_PATH from run_env if it exists
         # It should be passed via docker_env_vars in context instead
         if "MIOPEN_USER_DB_PATH" in run_env: