Deploy madengine workloads to Kubernetes or SLURM clusters for distributed execution.
madengine supports two deployment backends:
- Kubernetes - Cloud-native container orchestration
- SLURM - HPC cluster job scheduling
Deployment is configured via --additional-context and happens automatically during the run phase.
┌─────────────────────────────────────────────┐
│ 1. Build Phase (Local or CI/CD) │
│ madengine build --tags model │
│ → Creates Docker image │
│ → Pushes to registry │
│ → Generates build_manifest.json │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ 2. Deploy Phase (Run with Context) │
│ madengine run │
│ --manifest-file build_manifest.json │
│ --additional-context '{"deploy":...}' │
│ → Detects deployment target │
│ → Creates K8s Job or SLURM script │
│ → Submits and monitors execution │
└─────────────────────────────────────────────┘
- Kubernetes cluster with GPU support
- GPU device plugin installed (AMD or NVIDIA)
- Kubeconfig configured (
~/.kube/configor in-cluster) - Docker registry accessible from cluster
{
"k8s": {
"gpu_count": 1
}
}This automatically applies intelligent defaults for namespace, resources, image pull policy, etc.
# 1. Build image
madengine build --tags my_model \
--registry my-registry.io \
--additional-context-file k8s-config.json
# 2. Deploy to Kubernetes
madengine run \
--manifest-file build_manifest.json \
--timeout 3600The deployment target is automatically detected from the k8s key in the config.
k8s-config.json:
{
"k8s": {
"gpu_count": 2,
"namespace": "ml-team",
"gpu_vendor": "AMD",
"memory": "32Gi",
"cpu": "16",
"service_account": "madengine-sa",
"image_pull_policy": "Always"
}
}Configuration Priority:
- User config (
--additional-context-file) - Profile presets (single-gpu/multi-gpu)
- GPU vendor presets (AMD/NVIDIA)
- Base defaults
See examples/k8s-configs/ for complete examples.
By default (k8s.secrets.strategy: from_local_credentials), madengine run creates Kubernetes Secrets from a local credential.json when present: Docker Hub pull credentials (when configured) and an opaque Secret for runtime use. Credentials are not embedded in the ConfigMap in that case. For GitOps or clusters without client-side files, use existing or omit and set k8s.secrets.image_pull_secret_names / k8s.secrets.runtime_secret_name as needed. See Configuration and examples/k8s-configs/README.md.
With "debug": true in additional context, madengine run writes rendered manifests under ./k8s_manifests (or the path you configure). To lint those YAML files against the Kubernetes OpenAPI schema, install kubeconform and run from the repository root:
./tests/scripts/k8s_validate_manifests.sh ./k8s_manifestsThe script exits successfully if kubeconform is missing (skip) or if validation passes.
For distributed training across multiple nodes:
{
"k8s": {
"gpu_count": 8
},
"distributed": {
"launcher": "torchrun",
"nnodes": 2,
"nproc_per_node": 4
}
}This creates:
- Kubernetes Indexed Job with 2 completions
- Headless service for pod discovery
- Automatic rank assignment via
JOB_COMPLETION_INDEX MAD_MULTI_NODE_RUNNERenvironment variable with torchrun command
Supported Launchers:
torchrun- PyTorch DDP/FSDPdeepspeed- ZeRO optimizationmegatron- Megatron-LM trainingtorchtitan- LLM pre-trainingprimus- Primus unified pretrain (Megatron / TorchTitan / MaxText YAML)vllm- LLM inferencesglang- Structured generationsglang-disagg- Disaggregated SGLang (multi-node)slurm_multi/slurm-multi- Self-managed multi-container topologies (SLURM only, escape hatch)
See Launchers Guide for details.
# Check job status
kubectl get jobs -n your-namespace
# View pod logs
kubectl logs -f job/madengine-job-xxx -n your-namespace
# Check pod status
kubectl get pods -n your-namespaceFinished Jobs are not removed unless you set k8s.ttl_seconds_after_finished to a positive number of seconds; the Job manifest then includes ttlSecondsAfterFinished so the control plane can garbage-collect the Job after it finishes. The deploy step may still delete Secrets it created when cleaning up a failed or cancelled deploy—see runtime logs for details.
Manual cleanup:
kubectl delete job madengine-job-xxx -n your-namespace- Access to SLURM login node
- SLURM commands available (
sbatch,squeue,scontrol) - Shared filesystem for MAD package and results
- Module system or container runtime (Singularity/Apptainer)
slurm-config.json:
{
"slurm": {
"partition": "gpu",
"gpus_per_node": 4,
"time": "02:00:00",
"account": "my_account"
}
}# 1. Build image (on build node or locally)
madengine build --tags my_model \
--registry my-registry.io \
--additional-context-file slurm-config.json
# 2. SSH to SLURM login node
ssh user@hpc-login.example.com
# 3. Deploy to SLURM
cd /shared/workspace
madengine run \
--manifest-file build_manifest.json \
--timeout 7200The deployment target is automatically detected from the slurm key in the config.
slurm-config.json:
{
"slurm": {
"partition": "gpu",
"account": "research_group",
"qos": "normal",
"gpus_per_node": 8,
"nodes": 1,
"time": "24:00:00",
"mail_user": "user@example.com",
"mail_type": "ALL"
}
}Common SLURM Options:
partition: SLURM partition nameaccount: Billing accountqos: Quality of Servicegpus_per_node: Number of GPUs per nodenodes: Number of nodes (for multi-node)nodelist: Comma-separated node names to run on (e.g."node01,node02"); when set, job runs only on these nodes and node health preflight is skippedreservation: SLURM reservation name; forwarded to srun health/cleanup commandstime: Wall time limit (HH:MM:SS)mem: Memory per node (e.g., "64G")exclusive: Exclusive node access (default:true)mail_user: Email for job notificationsmail_type: Notification types (BEGIN, END, FAIL, ALL)
See examples/slurm-configs/ for complete examples.
For distributed training across SLURM nodes:
{
"slurm": {
"partition": "gpu",
"nodes": 4,
"gpus_per_node": 8,
"time": "48:00:00"
},
"distributed": {
"launcher": "torchrun",
"nnodes": 4,
"nproc_per_node": 8
}
}SLURM automatically provides:
- Node list via
$SLURM_JOB_NODELIST - Master address detection
- Network interface configuration
- Rank assignment via
$SLURM_PROCID
# Check job queue
squeue -u $USER
# Monitor job progress
squeue -j <job_id>
# View job details
scontrol show job <job_id>
# Check output logs
tail -f slurm-<job_id>.outFor workloads that use externally maintained Docker images (e.g. SGLang, vLLM releases):
# Skip Docker build, use a pre-built image
madengine build --tags model --use-image lmsysorg/sglang:latest
# Auto-detect image from model card's DOCKER_IMAGE_NAME
madengine build --tags model --use-image
# Build on a SLURM compute node and push to registry
madengine build --tags model --build-on-compute --registry docker.io/myorgThe manifest generated by --use-image merges the model card's distributed and slurm config into deployment_config, so the run phase auto-detects SLURM deployment without additional --additional-context.
For workloads that orchestrate their own per-node Docker containers (e.g. SGLang Disaggregated proxy + prefill + decode topologies), use the slurm_multi launcher:
{
"distributed": {
"launcher": "slurm_multi"
},
"slurm": {
"partition": "gpu",
"nodes": 3,
"gpus_per_node": 8,
"reservation": "my-reservation"
}
}Unlike templated launchers, slurm_multi runs the model's .slurm script directly on baremetal. The script manages its own Docker containers via srun internally. See Launchers Guide — slurm_multi for details.
When madengine run detects an existing SLURM allocation (SLURM_JOB_ID is set, e.g. inside salloc), the slurm_multi launcher runs the generated wrapper script synchronously with bash instead of nesting another sbatch. Other launchers continue to use sbatch even inside salloc.
salloc --nodes=3 --gpus-per-node=8 --partition=gpu
madengine run --manifest-file build_manifest.json
# → Detects salloc, runs synchronously# Cancel job
scancel <job_id>
# Cancel all your jobs
scancel -u $USER| Feature | Kubernetes | SLURM |
|---|---|---|
| Environment | Cloud, on-premise | HPC clusters |
| Orchestration | Automatic | Job scheduler |
| Dependencies | Python library (kubernetes) |
CLI commands only |
| Multi-node Setup | Headless service + DNS | SLURM env vars |
| Resource Management | Declarative (YAML) | Batch script |
| Best For | Cloud deployments, microservices | Academic HPC, supercomputers |
{
"k8s": {
"gpu_count": 1,
"namespace": "dev"
}
}{
"k8s": {
"gpu_count": 4,
"memory": "64Gi",
"cpu": "32"
},
"distributed": {
"launcher": "torchrun",
"nnodes": 1,
"nproc_per_node": 4
}
}{
"k8s": {
"gpu_count": 8,
"namespace": "ml-training"
},
"distributed": {
"launcher": "torchtitan",
"nnodes": 4,
"nproc_per_node": 8
}
}{
"slurm": {
"partition": "gpu",
"gpus_per_node": 8,
"time": "12:00:00"
}
}{
"slurm": {
"partition": "gpu",
"nodes": 8,
"gpus_per_node": 8,
"time": "72:00:00",
"account": "research_proj"
},
"distributed": {
"launcher": "deepspeed",
"nnodes": 8,
"nproc_per_node": 8
}
}Image Pull Failures:
# Check image exists
docker pull <registry>/<image>:<tag>
# Verify image pull secrets
kubectl get secrets -n your-namespace
# Check pod events
kubectl describe pod <pod-name> -n your-namespaceNode Reported as FAILED but Pod Succeeded:
In multi-node jobs, madengine may report a node as FAILED even though Kubernetes shows the pod as Succeeded. This occurs when the kubelet on the node becomes unreachable after the job completes, preventing madengine from collecting stdout logs (and therefore parsing performance metrics).
To verify:
# Check actual pod status — if Succeeded, the workload ran fine
kubectl describe pod <pod-name> | grep Status
# Check the node's kubelet health
kubectl get nodes
kubectl describe node <node-name> | grep -A5 ConditionsPVC artifacts are still collected in this scenario. Only the API-based pod log retrieval fails, which means performance metrics for that node will be missing from the results table.
Resource Issues:
# Check node resources
kubectl describe nodes | grep -A5 "Allocated resources"
# Check GPU availability
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.capacity.'amd\.com/gpu'Job Pending:
# Check reason
squeue -j <job_id> -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"
# Check partition status
sinfo -p gpuOut of Resources:
# Check available resources
sinfo -o "%P %.5a %.10l %.6D %.6t %N"
# Adjust resource requests in config- Use minimal configs with intelligent defaults
- Specify resource limits to prevent over-allocation
- Use appropriate namespaces for isolation
- Configure image pull policies based on registry location
- Monitor pod resource usage with
kubectl top
- Start with conservative time limits
- Use appropriate QoS for priority
- Monitor job efficiency with
seff <job_id> - Use shared filesystem for input/output
- Test with single node before scaling
- Launchers Guide - Distributed training and inference launchers
- K8s Examples - Complete Kubernetes configurations
- SLURM Examples - Complete SLURM configurations
- Usage Guide - General usage instructions