Commit 6f63116
feat: add KubeflowExecutor for Kubeflow Training Operator on Kubernetes
Introduces KubeflowExecutor and a matching TorchX scheduler so users can
deploy distributed training jobs to any Kubernetes cluster running the
Kubeflow Training Operator via run.run() / run.Experiment.
Supported job kinds (toggled via job_kind field):
- PyTorchJob (Training Operator v1, kubeflow.org/v1)
- TrainJob (Training Operator v2, trainer.kubeflow.org/v1alpha1)
Key features:
- Kubernetes config loaded automatically (local kubeconfig → in-cluster fallback)
- PyTorchJob: builds Master + Worker replica specs with nprocPerNode
- TrainJob: builds spec.trainer + merges all pod-level config (volumes,
tolerations, affinity, imagePullSecrets, resourceClaims, etc.) into a
single podTemplateOverrides entry targeting "node"
- env_list field supports full env var dicts (valueFrom / secretKeyRef)
- pod_spec_overrides merges arbitrary extra fields into the pod spec
- launch(wait=True) polls until RUNNING / SUCCEEDED / FAILED
- cancel(wait=True) polls until CR is gone and all pods are terminated
- TorchX scheduler persists job state in ~/.nemo_run/.kubeflow_jobs.json
and maps KubeflowJobState → AppState (UNKNOWN/None → PENDING to avoid
false failures on transient API errors)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>1 parent e04fe9d commit 6f63116
3 files changed
Lines changed: 47 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
183 | 183 | | |
184 | 184 | | |
185 | 185 | | |
| 186 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
| 56 | + | |
56 | 57 | | |
57 | 58 | | |
58 | 59 | | |
| |||
293 | 294 | | |
294 | 295 | | |
295 | 296 | | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
296 | 340 | | |
297 | 341 | | |
298 | 342 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
194 | 194 | | |
195 | 195 | | |
196 | 196 | | |
197 | | - | |
| 197 | + | |
| 198 | + | |
198 | 199 | | |
199 | 200 | | |
200 | 201 | | |
| |||
0 commit comments