Tk/pr 2429 by terrykong · Pull Request #2463 · NVIDIA-NeMo/RL

terrykong · 2026-05-11T17:12:19Z

#2429 Same as 2429. Removing additional manifest since each dynamo deployment will need tinkering. Switched to karpenter node selector since static CPU nodes were exhausted. Also includes simpler example of equivalence judge which should be mergable today

FYI @jthomson04

Adds policy.generation.backend=dynamo, a Kubernetes-only generation backend that forwards rollouts to an externally-managed DynamoGraphDeployment frontend over HTTP. The class is a thin wrapper around the resolved frontend URL — no etcd / NATS / worker subprocess management. The DGD owns the inference stack; nemo-rl just points nemo-gym at it. Two ways to specify the frontend (mutually exclusive in the config): * dgd_name (+ optional namespace, frontend_port) — the class derives the cluster-internal URL from the dynamo operator's stable Service naming convention (<dgd-name>-frontend). Requires running inside a pod. * frontend_url — explicit URL escape hatch for hand-rolled DGDs, external clusters, or non-K8s environments. Disables the in-pod assertion. GRPO setup wiring: * is_dynamo flag forces colocated_inference=False and routes all GPUs to training (the DGD handles inference on its own pods). * Dispatch case mirrors the vllm/sglang branches via initialize_generation_with_policy. * NEED_REFIT=False for the dynamo backend in both grpo_train and async_grpo_train — refit isn't supported in this phase, so dynamo runs are effectively frozen-policy (eval / inference-only experiments). Live refit deferred to a later phase via DGD restart or in-place worker reload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: jthomson04 <jwillthomson19@gmail.com>

Teaches the nrl-k8s CLI to bring up a DynamoGraphDeployment (DGD) alongside the training RayCluster. After this lands, a single nrl-k8s run --raycluster brings up both, waits for the DGD to be state=successful, and stamps the DGD's name into the recipe before submitting the training Ray Job. Schema (infra/nrl_k8s/src/nrl_k8s/schema.py): * New DynamoGraphSpec — references a standalone DGD manifest by path (typically one of dynamo/recipes/...). Supports name override and deep-merged overrides without forking the upstream recipe. * New infra.dynamo: dict[str, DynamoGraphSpec], parallel to kuberay and deployments. DGD module (infra/nrl_k8s/src/nrl_k8s/dgd.py): * load_dgd_manifest — resolves repo-relative paths against the infra YAML's directory; picks the DGD doc out of multi-doc files (skipping benchmark Pods that ship alongside in dynamo/recipes/). * build_dgd_manifest — deep-merges DynamoGraphSpec.overrides onto the loaded .spec, applies metadata.name override, sets namespace, merges labels, and patches cross-cutting infra fields (image as default only, imagePullSecrets, serviceAccount) across services[*].extraPodSpec. * resolve_dgd_name — for the recipe-injection path, returns the post-override metadata.name. * apply_dgd / get_dgd / delete_dgd / wait_for_dgd_ready / wait_for_dgd_gone mirror the RayCluster helpers in k8s.py. * is_dgd_crd_installed — namespaced list probe; treats 403 as "installed, user lacks list-RBAC" so restricted RBAC doesn't false- positive trigger the install hint. Orchestrator (infra/nrl_k8s/src/nrl_k8s/orchestrate.py): * ensure_dgd / delete_dgd mirror ensure_deployment semantics (idempotent reuse on match, warn on drift, --recreate to replace). * _inject_dynamo_into_recipe stamps policy.generation.backend=dynamo and policy.generation.dynamo_cfg.dgd_name=<resolved-name> when exactly one DGD is declared. Multi-DGD configs leave the recipe alone. * run() brings up DGDs alongside Deployments before RayClusters. * LoadedConfig.infra_source_path tracks where the dynamo: block was declared so manifest paths resolve correctly in both bundled and split layouts. CLI (infra/nrl_k8s/src/nrl_k8s/cli.py): * --target dynamo.<key> resolves alongside kuberay.<role> and deployments.<key>. cluster up/down dispatch through ensure_dgd / delete_dgd; --dry-run prints the rendered DGD manifest. * nrl-k8s check fails fast with a helmfile-install hint when infra.dynamo is set but the DGD CRD is missing. * _print_check_summary surfaces a DYNAMO section listing each DGD's services and ready-timeout. Helmfile (infra/helm/): * dynamo-platform release added with values that mesh with our existing kai-scheduler install (install=false, enabled=true) and use bundled etcd + NATS for service discovery / event plane. Examples: * Working recipe + infra pair targeting kind clusters (dynamo_qwen3_0.6b.{yaml,kind.infra.yaml}) plus a minimal DGD manifest (examples_dgd/qwen3_0.6b_kind.yaml). Out of scope this PR (follow-ups): * --rayjob mode integration (needs ownerReferences on the DGD pointing at the RayJob). * Grove integration for multi-node DGDs. * Refit + planner autoscaling (Phase 3 — borrow slime/slynamo's external_discovery.py topology-fingerprint pattern). Tests: 241 unit tests pass under Python 3.12. nrl-k8s check renders the new example pair end-to-end against a live cluster. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: jthomson04 <jwillthomson19@gmail.com>

When nrl-k8s is the one applying a DGD (i.e. infra.dynamo.<key> is declared), tie its lifetime to the training RayCluster via Kubernetes ownerReferences. K8s GC cascades the DGD when the RayCluster is deleted, so inference GPUs free at the same moment as the training GPUs without hand-rolled cleanup logic. Owner-of-choice: RayCluster, not RayJob — picking the cluster as parent means DGD pods die the moment shutdownAfterJobFinishes fires, instead of hanging around until ttlSecondsAfterFinished (default 1h). User-managed DGDs (recipes that only set policy.generation.dynamo_cfg.frontend_url without an infra.dynamo entry) are untouched: nrl-k8s never applies the DGD, so it never sets an ownerReference, so it has nothing to cascade. Implementation: * dgd.build_owner_reference helper sets controller=False (the dynamo operator already controls the DGD; we're a non-controlling owner solely for GC). * dgd.build_dgd_manifest accepts an owner_ref kwarg that lands in metadata.ownerReferences[0]. * ensure_dgd threads owner_ref through to the builder. * k8s.wait_for_rayjob_raycluster_name polls the RayJob's .status.rayClusterName so the rayjob path can resolve KubeRay's auto-generated cluster name. Long-lived path (orchestrate.run): * When any DGDs are declared, ensure the training RayCluster *first*, fetch its UID, and pass an ownerReference into the DGD apply loop. * When no DGDs are declared, behaviour is unchanged (no extra API call). Rayjob path (cli._run_rayjob): * After applying the RayJob, poll for its .status.rayClusterName, look up the RayCluster's UID, then apply each DGD with an ownerReference pointing at the RayCluster. The DGD apply happens in parallel with KubeRay's own entrypoint submission — the entrypoint should tolerate a brief window where DGD pods are still coming up (a /health curl-loop in the recipe is the natural pattern). * --dry-run also renders the DGD manifest (without owner ref, since the UID isn't known until apply time). Tests: 249 unit tests pass (8 new). Covers owner_ref attachment, the new rayjob status poll, and end-to-end orchestrate.run wiring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: jthomson04 <jwillthomson19@gmail.com>

The qwen3-30b infra entrypoint pipes `python ... 2>&1 | tee "$LOG"` under /bin/dash, which has no `set -o pipefail` and no $PIPESTATUS. A Python crash (e.g. ImportError before training starts) leaves the pipeline exit 0 because tee succeeds, and KubeRay records the RayJob as SUCCEEDED — the failure is silent until someone reads the log. Route python's stdout/stderr through a fifo that tee drains, so the shell sees python's real exit code and `exit "$RC"` propagates it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: jthomson04 <jothomson@nvidia.com>

`nrl-k8s status` looked up the RayCluster by the bare `cluster.name` (the value users put in `kuberay.<role>.name`). In `--rayjob` mode KubeRay creates the RayCluster with a 5-char random suffix and writes the suffixed name to the RayJob's `.status.rayClusterName`, so the bare lookup 404s and every role rendered as "(not found)" for the entire run lifetime. When the bare cluster lookup misses, fall through to a RayJob lookup on the same name and follow `.status.rayClusterName` to find the suffixed cluster. The displayed `name` stays the configured name; pod listing and daemon dashboard URLs use the resolved name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: jthomson04 <jothomson@nvidia.com>

`_patch_dgd_service_account` was unconditionally overwriting `serviceAccountName` on every service in a DynamoGraphDeployment, which broke the dynamo operator's standard pattern of generating a per-DGD `<dgd>-k8s-service-discovery` SA with RBAC for `endpointslices` and `dynamoworkermetadatas`. With our infra's SA injected instead, the worker pods 403 on their discovery reflectors and the DGD deadlocks at state=pending. The dynamo operator owns the SA wiring for DGD pods. nrl-k8s should only honour an explicit `serviceAccountName` already declared in the manifest's extraPodSpec; otherwise leave the field unset so the operator can fill it in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: jthomson04 <jothomson@nvidia.com>

Two coupled changes that together let the dynamo backend run a full GRPO step end-to-end through nemo-gym: 1. `_should_use_async_rollouts` and `_should_use_nemo_gym` in `algorithms/grpo.py` accept `backend == "dynamo"`. DynamoGeneration exposes `dp_openai_server_base_urls` (the DGD frontend URL) the same way an async-vLLM generator does, so the gym dispatch path works without further changes — the gates just hard-asserted vLLM only. The vllm-specific `expose_http_server` check is now scoped to the vllm branch (Dynamo always exposes a frontend; there's no analogous knob). 2. `print_performance_metrics` in `algorithms/utils.py` short-circuits to `training_num_gpus = total_num_gpus` and `generation_num_gpus = 0` when `backend == "dynamo"`. With dynamo, generation lives in a separate DGD outside the Ray cluster, so the existing read of `policy.generation.colocated.resources.gpus_per_node` (which is null on the dynamo path) was raising `TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'` after a successful step. Also guarded the per-GPU generation-throughput division to avoid a ZeroDivisionError in this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: jthomson04 <jothomson@nvidia.com>

`_postprocess_nemo_gym_to_nemo_rl_result` asserts that the policy's `prompt_token_ids` for turn N+1 form a byte-identical extension of the tokens accumulated through turn N. Against the Dynamo `tokenize-endpoint` image (jwillthomson/dynamo-arm-rl-tokenize-endpoint-*) this fires immediately on the first multi-turn rollout — the frontend appears to re-tokenize the chat history rather than carry token IDs verbatim across turns. Disable the assert to a `RuntimeWarning` so the dynamo+nrl-k8s integration smoke can validate the rest of the pipeline (gym dispatch, reward computation, Megatron logprobs/training step, perf metrics print, teardown). A `TODO(dynamo-smoke):` block marks the spot to re-enable once the tokenize endpoint returns verbatim token IDs (or nemo-gym is taught to re-derive contiguity from text + tokenizer). Until then, advantages and logprobs computed on the dynamo path are approximate — quality numbers from this rollout flow can't be trusted yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: jthomson04 <jothomson@nvidia.com>

End-to-end smoke pair for the dynamo k8s integration on a GB300 NVL72: * recipe: examples/nemo_gym/grpo_workplace_assistant_dynamo_smoke.yaml Inherits from grpo_workplace_assistant_nemotron_nano_v2_9b.yaml, swaps in Qwen3-4B-Thinking-2507 with Megatron TP=1, clears the Nemotron MoE knobs, sets policy.generation.backend=dynamo, and trims to 4 rollouts/step × 1 step (no validation, no checkpoint, no wandb). * infra: infra/nrl_k8s/examples/grpo_workplace_assistant_dynamo_smoke.gb300.infra.yaml 1-GPU Ray training cluster + a `dynamo:` block referencing the DGD manifest. The entrypoint passes `+policy.generation.dynamo_cfg.dgd_name=${user:}-dynamo-wpa-smoke` as a Hydra override since the orchestrator's auto-inject only fires in code_source: upload mode. * DGD: infra/nrl_k8s/examples_dgd/qwen3_4b_thinking_gb300.yaml Frontend on customer-cpu, 1× VllmDecodeWorker on a GB300 node, with `nvidia.com/kai-scheduler-queue: backfill` annotation (the operator's default queue `dynamo` is rejected by Kyverno on this cluster). vLLM worker uses --dyn-tool-call-parser hermes and --dyn-reasoning-parser qwen3. Smoke validates: nrl-k8s cluster up + DGD apply + DGD ready + gym 3-server bring-up + Dynamo HTTP rollouts + Megatron policy step + perf metrics print, in ~2m15s on a warm cluster. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: jthomson04 <jothomson@nvidia.com>

When credentials expire, `dev connect` misreported the problem as missing RBAC because `kubectl auth can-i` returns non-"yes" on auth failure. Add a cached `check_api_reachable()` that pings the version endpoint first, giving a clear "credentials expired" message instead. Signed-off-by: Terry Kong <terryk@nvidia.com>

Add infra and training configs for running Nemotron-3-Nano-30B GRPO training with LLM-as-a-judge (math_with_judge) on GB300 NVL72. Two judge approaches, both tested end-to-end (10 steps each): 1. local_vllm_model: Qwen3-30B-A3B-Instruct judge spun up by Gym inside the Ray cluster (5 nodes, requires venv/torch/device.py workarounds on GB300 aarch64). 2. Dynamo: Same judge model served via DynamoGraphDeployment using nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.1 (4 training nodes + external DGD, no workarounds needed). Both produce equivalent training results (~70-80s/step, KL 0.001-0.003). Files: - examples/nemo_gym/grpo_nanov3_judge_4n4g.yaml (local_vllm_model) - examples/nemo_gym/grpo_nanov3_judge_dynamo_4n4g.yaml (Dynamo) - infra/nrl_k8s/examples/nanov3_judge_4n4g.gb300.infra.yaml - infra/nrl_k8s/examples/nanov3_judge_dynamo_4n4g.gb300.infra.yaml - infra/nrl_k8s/examples_dgd/qwen3_30ba3b_instruct_judge_gb300.yaml Signed-off-by: Terry Kong <terryk@nvidia.com>

…o, and autoscaling Three variants of equivalence_llm_judge for nano v3 on GB300 NVL72: 1. local_vllm_model baseline (5 nodes: 4 train + 1 judge) 2. Dynamo static (4 train nodes + 1-replica DGD) 3. Dynamo + Grove autoscaling (4 train nodes + Planner + 1-4 replica DGD) All three completed 10-step pipecleans. equivalence_llm_judge sends 100% of samples through the LLM judge (vs math_with_judge ~30% fallback). Autoscaling DGD adds a Planner service running dynamo.planner with load-based scaling and scalingAdapter on VllmDecodeWorker replicas. wandb runs: local_vllm: https://wandb.ai/nvidia/nemorl-nanov3-judge/runs/da3vxorh dynamo: https://wandb.ai/nvidia/nemorl-nanov3-judge/runs/35fdqvw3 autoscale: https://wandb.ai/nvidia/nemorl-nanov3-judge/runs/2ve6yauj Signed-off-by: Terry Kong <terryk@nvidia.com>

…ifests Match the Dynamo vLLM config to what local_vllm_model uses: - Remove --enforce-eager (enables CUDA graphs, significant perf gain) - Remove --max-model-len 8192 (let vLLM use model default) These were copied from jthomson's PR config but aren't needed on GB300 with vllm-runtime:1.1.1 and cause measurable throughput differences vs the local_vllm_model baseline. Signed-off-by: Terry Kong <terryk@nvidia.com>

DynamoGraphDeployment (DGD) configuration now lives inline in the infra YAML alongside RayCluster specs, instead of referencing separate manifest files. This lets DGD specs use YAML anchors from `_shared:` (node selectors, env vars, volumes) and eliminates the need to fork DGD manifests for per-user customization. Schema: `DynamoGraphSpec` now has required `name` and `spec` fields, matching `DeploymentSpec`. The `manifest`, `overrides` fields and `load_dgd_manifest()` are removed — this feature only exists on this branch, not on main. Code: `build_dgd_manifest()` builds the K8s envelope (apiVersion/kind/ metadata) around the inline spec, same pattern as RayClusters. All 5 Dynamo infra examples migrated. The `examples_dgd/` directory is removed. Signed-off-by: Terry Kong <terryk@nvidia.com>

The static customer-cpu ARM nodes (2 nodes) are often fully occupied by Ray head pods. DGD Frontend and Planner pods now target Karpenter's "cpu" NodePool (amd64 m6a instances) which provisions on demand. The vllm-runtime:1.1.1 image is multi-arch so amd64 works without issues. Signed-off-by: Terry Kong <terryk@nvidia.com>

copy-pr-bot · 2026-05-11T17:12:23Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jthomson04

This won't work in its current state. Can you cherry-pick the latest commits on my branch? We also need a change on the Nemo-Gym side to move the prefix substitution that vllm does into some shared logic. NVIDIA-NeMo/Gym#1294

jthomson04 and others added 15 commits May 5, 2026 18:49

terrykong requested review from a team as code owners May 11, 2026 17:12

terrykong marked this pull request as draft May 11, 2026 17:12

jthomson04 requested changes May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tk/pr 2429#2463

Tk/pr 2429#2463
terrykong wants to merge 15 commits into
mainfrom
tk/pr-2429

terrykong commented May 11, 2026

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

jthomson04 left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

terrykong commented May 11, 2026

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

jthomson04 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jthomson04 left a comment •

edited

Loading