fix(nrl-k8s): rewrite --config in entrypoint to honor CLI RECIPE arg#2449
Conversation
`nrl-k8s run RECIPE --infra INFRA` previously parsed RECIPE only on the client side (Hydra defaults + override merging). The pod ran whatever recipe the infra entrypoint hardcoded in its `--config <path>.yaml` — so `nrl-k8s run RECIPE_A` against an infra whose entrypoint hardcodes `--config RECIPE_B` would silently run RECIPE_B. Rewrite the train entrypoint so every `--config <path>.yaml` (or `--config=<path>.yaml`, with optional surrounding quotes) is substituted with the in-pod path of the user-supplied RECIPE before the entrypoint is sent to the submitter. Existing infras keep working unchanged; the CLI RECIPE arg is now authoritative without anyone having to edit YAMLs. The in-pod path is mode-aware: - `codeSource=upload` → `nrl_k8s_run.yaml` (the staged copy of the resolved recipe inside the working_dir). - `codeSource=image|lustre` → recipe path relative to repo root (entrypoints `cd /opt/nemo-rl` first, so the relative path resolves under the Lustre/in-image mount). - Recipe outside the repo root in image/lustre mode → no-op (we can't translate the path). Daemon entrypoints (gym/generation) are not rewritten — they have their own configs distinct from the train recipe and go through a separate submission path. Hydra's `--config-name` / `--config-dir` / `--config-path` are excluded from the regex so they don't false-fire. Tests: 12 new unit tests in `test_orchestrate.py` covering equals separator, quoting preservation, multi-`--config`, recipe-outside-repo, upload-vs-lustre mode, and the Hydra-flag false-positive case. Full orchestrate suite (32 tests) passes. Signed-off-by: Hemil Desai <hemild@nvidia.com>
8534191 to
af841cc
Compare
--config in entrypoint to honor CLI RECIPE arg
terrykong
left a comment
There was a problem hiding this comment.
Review: PR #2449 — fix(nrl-k8s): rewrite --config in entrypoint to honor CLI RECIPE arg
by @hemildesai | 3 files changed | 3 review agents
Clean PR — well-implemented fix with thorough test coverage (12 new tests covering all code paths). The regex is carefully designed to avoid matching Hydra's --config-name/--config-dir/--config-path, and the integration into submit_training is placed correctly.
All 32 unit tests pass. No guideline violations, no bugs found.
One minor suggestion below.
Generated by Claude Code
terrykong
left a comment
There was a problem hiding this comment.
LGTM. Clean fix for a real bug — the regex correctly discriminates --config from Hydra's --config-name/--config-dir/--config-path, path translation handles upload vs image/lustre vs outside-repo, and tests are thorough.
|
/ok to test af841cc |
CI Failure: Lint checkLint check — two pre-commit hooks failed: 1.
|
Fix D205 docstring violation in _rewrite_entrypoint_recipe and apply ruff formatting to orchestrate.py and test_orchestrate.py. Signed-off-by: Terry Kong <terryk@nvidia.com>
|
/ok to test 8add93d |
Summary
Rewrite every
--config <path>.yamlin the train entrypoint to point at the user-supplied RECIPE before submitting to the pod. Existing infras keep working unchanged; the CLI'sRECIPEargument is now authoritative without forcing infra YAMLs to be edited.Why
nrl-k8s run RECIPE --infra INFRApreviously parsedRECIPEonly on the client side (Hydra defaults + override merging). The pod ran whatever recipe the infra entrypoint hardcoded — sonrl-k8s run RECIPE_Aagainst an infra that hardcodes--config RECIPE_Bwould silently runRECIPE_Bwhile the user thought they were runningRECIPE_A.Behavior
In
submit_training, before handing the entrypoint to the submitter, rewrite every--config <path>.yamlto use the in-pod path of the user's RECIPE:codeSource=uploadnrl_k8s_run.yaml(the staged copy of the resolved recipe in the working_dir)codeSource=image|lustreQuoting and
=/space separator are preserved (--config "a.yaml"→--config "x.yaml",--config=a.yaml→--config=x.yaml). Hydra's--config-name/--config-dir/--config-pathare not matched. Daemon entrypoints (gym/generation) are not touched — they have their own configs distinct from the train recipe.When a substitution actually changes the path, it's logged:
[training] rewrote \--config a.yaml` -> `--config x.yaml` from CLI recipe arg` — so the rewrite is visible in submission logs.Test plan
pytest tests/unit/test_orchestrate.py— 32 tests pass (12 new inTestRecipePathInPod+TestRewriteEntrypointRecipe).--config infra/nrl_k8s/examples/qwen3_30b_if_fault_tolerant_full.yamlwhen invoked with a different RECIPE.🤖 Generated with Claude Code