Skip to content

fix(nrl-k8s): rewrite --config in entrypoint to honor CLI RECIPE arg#2449

Merged
terrykong merged 2 commits into
mainfrom
hemil/nrl-k8s-recipe-path-env
May 11, 2026
Merged

fix(nrl-k8s): rewrite --config in entrypoint to honor CLI RECIPE arg#2449
terrykong merged 2 commits into
mainfrom
hemil/nrl-k8s-recipe-path-env

Conversation

@hemildesai
Copy link
Copy Markdown
Contributor

@hemildesai hemildesai commented May 9, 2026

Summary

Rewrite every --config <path>.yaml in the train entrypoint to point at the user-supplied RECIPE before submitting to the pod. Existing infras keep working unchanged; the CLI's RECIPE argument is now authoritative without forcing infra YAMLs to be edited.

Why

nrl-k8s run RECIPE --infra INFRA previously parsed RECIPE only on the client side (Hydra defaults + override merging). The pod ran whatever recipe the infra entrypoint hardcoded — so nrl-k8s run RECIPE_A against an infra that hardcodes --config RECIPE_B would silently run RECIPE_B while the user thought they were running RECIPE_A.

Behavior

In submit_training, before handing the entrypoint to the submitter, rewrite every --config <path>.yaml to use the in-pod path of the user's RECIPE:

Mode In-pod path
codeSource=upload nrl_k8s_run.yaml (the staged copy of the resolved recipe in the working_dir)
codeSource=image|lustre RECIPE path relative to repo root
Recipe outside repo root No-op (can't translate path)

Quoting and =/space separator are preserved (--config "a.yaml"--config "x.yaml", --config=a.yaml--config=x.yaml). Hydra's --config-name / --config-dir / --config-path are not matched. Daemon entrypoints (gym/generation) are not touched — they have their own configs distinct from the train recipe.

When a substitution actually changes the path, it's logged: [training] rewrote \--config a.yaml` -> `--config x.yaml` from CLI recipe arg` — so the rewrite is visible in submission logs.

Test plan

  • pytest tests/unit/test_orchestrate.py — 32 tests pass (12 new in TestRecipePathInPod + TestRewriteEntrypointRecipe).
  • Verified the rewrite hits a real-world FT 30B infra entrypoint that hardcodes --config infra/nrl_k8s/examples/qwen3_30b_if_fault_tolerant_full.yaml when invoked with a different RECIPE.
  • Existing infras still work unchanged (back-compat — they hardcode the path, the rewrite either no-ops or substitutes to the same path).

🤖 Generated with Claude Code

@hemildesai hemildesai requested a review from a team as a code owner May 9, 2026 15:30
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

`nrl-k8s run RECIPE --infra INFRA` previously parsed RECIPE only on
the client side (Hydra defaults + override merging). The pod ran
whatever recipe the infra entrypoint hardcoded in its `--config
<path>.yaml` — so `nrl-k8s run RECIPE_A` against an infra whose
entrypoint hardcodes `--config RECIPE_B` would silently run RECIPE_B.

Rewrite the train entrypoint so every `--config <path>.yaml` (or
`--config=<path>.yaml`, with optional surrounding quotes) is
substituted with the in-pod path of the user-supplied RECIPE before
the entrypoint is sent to the submitter. Existing infras keep working
unchanged; the CLI RECIPE arg is now authoritative without anyone
having to edit YAMLs.

The in-pod path is mode-aware:
- `codeSource=upload` → `nrl_k8s_run.yaml` (the staged copy of the
  resolved recipe inside the working_dir).
- `codeSource=image|lustre` → recipe path relative to repo root
  (entrypoints `cd /opt/nemo-rl` first, so the relative path resolves
  under the Lustre/in-image mount).
- Recipe outside the repo root in image/lustre mode → no-op (we can't
  translate the path).

Daemon entrypoints (gym/generation) are not rewritten — they have
their own configs distinct from the train recipe and go through a
separate submission path. Hydra's `--config-name` / `--config-dir` /
`--config-path` are excluded from the regex so they don't false-fire.

Tests: 12 new unit tests in `test_orchestrate.py` covering equals
separator, quoting preservation, multi-`--config`, recipe-outside-repo,
upload-vs-lustre mode, and the Hydra-flag false-positive case. Full
orchestrate suite (32 tests) passes.

Signed-off-by: Hemil Desai <hemild@nvidia.com>
@hemildesai hemildesai force-pushed the hemil/nrl-k8s-recipe-path-env branch from 8534191 to af841cc Compare May 9, 2026 15:58
@hemildesai hemildesai changed the title fix(nrl-k8s): inject NRL_K8S_RECIPE_PATH and warn on entrypoint mismatch fix(nrl-k8s): rewrite --config in entrypoint to honor CLI RECIPE arg May 9, 2026
@terrykong terrykong requested review from terrykong and yuki-97 May 10, 2026 03:59
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #2449 — fix(nrl-k8s): rewrite --config in entrypoint to honor CLI RECIPE arg

by @hemildesai | 3 files changed | 3 review agents

Clean PR — well-implemented fix with thorough test coverage (12 new tests covering all code paths). The regex is carefully designed to avoid matching Hydra's --config-name/--config-dir/--config-path, and the integration into submit_training is placed correctly.

All 32 unit tests pass. No guideline violations, no bugs found.

One minor suggestion below.

Generated by Claude Code

Comment thread infra/nrl_k8s/tests/unit/test_orchestrate.py
terrykong
terrykong previously approved these changes May 10, 2026
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Clean fix for a real bug — the regex correctly discriminates --config from Hydra's --config-name/--config-dir/--config-path, path translation handles upload vs image/lustre vs outside-repo, and tests are thorough.

@terrykong terrykong enabled auto-merge (squash) May 10, 2026 07:15
@terrykong terrykong added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 11, 2026
@terrykong
Copy link
Copy Markdown
Collaborator

/ok to test af841cc

@terrykong
Copy link
Copy Markdown
Collaborator

CI Failure: Lint check

Lint check — two pre-commit hooks failed:

1. ruff — D205 (missing blank line in docstring)

infra/nrl_k8s/src/nrl_k8s/orchestrate.py:579:5: D205 1 blank line required between summary line and description

The _rewrite_entrypoint_recipe docstring has the summary line wrapping directly into the description body. Add a blank line after the summary:

    """Rewrite ``--config <path>.yaml`` flags in the train entrypoint to
    point at the CLI's RECIPE.
                                          # <-- needs blank line here
    Without this rewrite, ...

Or make the summary a single line, then blank line, then description.

2. ruff-format — 2 files reformatted

Formatting changes needed in orchestrate.py and test_orchestrate.py:

  • orchestrate.py: long _rewrite_entrypoint_recipe(...) call needs line-wrapping; _CONFIG_FLAG_RE = re.compile(...) should be a single line
  • test_orchestrate.py: a few method signatures and string literals can fit on one line
Full formatting diff
--- a/infra/nrl_k8s/src/nrl_k8s/orchestrate.py
+++ b/infra/nrl_k8s/src/nrl_k8s/orchestrate.py
@@ -505,7 +505,9 @@
     # not touched here; they have their own configs and go through a
     # separate submission path.
-    entrypoint = _rewrite_entrypoint_recipe(launch.entrypoint, loaded, repo_root, upload=upload, log=log)
+    entrypoint = _rewrite_entrypoint_recipe(
+        launch.entrypoint, loaded, repo_root, upload=upload, log=log
+    )

@@ -538,9 +540,7 @@
-_CONFIG_FLAG_RE = re.compile(
-    r"(--config(?:\s+|=))(['\"]?)([^\s'\"]+\.ya?ml)\2"
-)
+_CONFIG_FLAG_RE = re.compile(r"(--config(?:\s+|=))(['\"]?)([^\s'\"]+\.ya?ml)\2")

--- a/infra/nrl_k8s/tests/unit/test_orchestrate.py
+++ b/infra/nrl_k8s/tests/unit/test_orchestrate.py
@@ -518,9 +518,7 @@
-    def test_rewrites_hardcoded_config_to_cli_recipe(
-        self, tmp_path: Path, log
-    ) -> None:
+    def test_rewrites_hardcoded_config_to_cli_recipe(self, tmp_path: Path, log) -> None:

@@ -585,10 +583,7 @@
-        ep = (
-            "python train.py --config-name=foo --config-dir=bar "
-            "--config-path conf"
-        )
+        ep = "python train.py --config-name=foo --config-dir=bar --config-path conf"

Fix

@hemildesai — the quickest way to fix this is to set up pre-commit hooks locally so these get caught before push:

uv run --group dev pre-commit install
uv run --group dev pre-commit run --all-files

See the Contributing Guide for full setup instructions.

Fix D205 docstring violation in _rewrite_entrypoint_recipe and apply
ruff formatting to orchestrate.py and test_orchestrate.py.

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong
Copy link
Copy Markdown
Collaborator

/ok to test 8add93d

@terrykong terrykong merged commit 92496d6 into main May 11, 2026
27 checks passed
@terrykong terrykong deleted the hemil/nrl-k8s-recipe-path-env branch May 11, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants