Skip to content

Dynamo Nemo-RL K8s integration#2429

Draft
jthomson04 wants to merge 15 commits into
NVIDIA-NeMo:mainfrom
jthomson04:dynamo-k8s-integration
Draft

Dynamo Nemo-RL K8s integration#2429
jthomson04 wants to merge 15 commits into
NVIDIA-NeMo:mainfrom
jthomson04:dynamo-k8s-integration

Conversation

@jthomson04
Copy link
Copy Markdown
Contributor

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jthomson04 and others added 13 commits May 11, 2026 15:52
Adds policy.generation.backend=dynamo, a Kubernetes-only generation
backend that forwards rollouts to an externally-managed DynamoGraphDeployment
frontend over HTTP. The class is a thin wrapper around the resolved frontend
URL — no etcd / NATS / worker subprocess management. The DGD owns the
inference stack; nemo-rl just points nemo-gym at it.

Two ways to specify the frontend (mutually exclusive in the config):
  * dgd_name (+ optional namespace, frontend_port) — the class derives the
    cluster-internal URL from the dynamo operator's stable Service naming
    convention (<dgd-name>-frontend). Requires running inside a pod.
  * frontend_url — explicit URL escape hatch for hand-rolled DGDs, external
    clusters, or non-K8s environments. Disables the in-pod assertion.

GRPO setup wiring:
  * is_dynamo flag forces colocated_inference=False and routes all GPUs to
    training (the DGD handles inference on its own pods).
  * Dispatch case mirrors the vllm/sglang branches via
    initialize_generation_with_policy.
  * NEED_REFIT=False for the dynamo backend in both grpo_train and
    async_grpo_train — refit isn't supported in this phase, so dynamo
    runs are effectively frozen-policy (eval / inference-only experiments).
    Live refit deferred to a later phase via DGD restart or in-place
    worker reload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Teaches the nrl-k8s CLI to bring up a DynamoGraphDeployment (DGD)
alongside the training RayCluster. After this lands, a single
nrl-k8s run --raycluster brings up both, waits for the DGD to be
state=successful, and stamps the DGD's name into the recipe before
submitting the training Ray Job.

Schema (infra/nrl_k8s/src/nrl_k8s/schema.py):
  * New DynamoGraphSpec — references a standalone DGD manifest by path
    (typically one of dynamo/recipes/...). Supports name override and
    deep-merged overrides without forking the upstream recipe.
  * New infra.dynamo: dict[str, DynamoGraphSpec], parallel to
    kuberay and deployments.

DGD module (infra/nrl_k8s/src/nrl_k8s/dgd.py):
  * load_dgd_manifest — resolves repo-relative paths against the infra
    YAML's directory; picks the DGD doc out of multi-doc files (skipping
    benchmark Pods that ship alongside in dynamo/recipes/).
  * build_dgd_manifest — deep-merges DynamoGraphSpec.overrides onto the
    loaded .spec, applies metadata.name override, sets namespace, merges
    labels, and patches cross-cutting infra fields (image as default
    only, imagePullSecrets, serviceAccount) across services[*].extraPodSpec.
  * resolve_dgd_name — for the recipe-injection path, returns the
    post-override metadata.name.
  * apply_dgd / get_dgd / delete_dgd / wait_for_dgd_ready / wait_for_dgd_gone
    mirror the RayCluster helpers in k8s.py.
  * is_dgd_crd_installed — namespaced list probe; treats 403 as
    "installed, user lacks list-RBAC" so restricted RBAC doesn't false-
    positive trigger the install hint.

Orchestrator (infra/nrl_k8s/src/nrl_k8s/orchestrate.py):
  * ensure_dgd / delete_dgd mirror ensure_deployment semantics
    (idempotent reuse on match, warn on drift, --recreate to replace).
  * _inject_dynamo_into_recipe stamps policy.generation.backend=dynamo and
    policy.generation.dynamo_cfg.dgd_name=<resolved-name> when exactly
    one DGD is declared. Multi-DGD configs leave the recipe alone.
  * run() brings up DGDs alongside Deployments before RayClusters.
  * LoadedConfig.infra_source_path tracks where the dynamo: block was
    declared so manifest paths resolve correctly in both bundled and
    split layouts.

CLI (infra/nrl_k8s/src/nrl_k8s/cli.py):
  * --target dynamo.<key> resolves alongside kuberay.<role> and
    deployments.<key>. cluster up/down dispatch through ensure_dgd /
    delete_dgd; --dry-run prints the rendered DGD manifest.
  * nrl-k8s check fails fast with a helmfile-install hint when
    infra.dynamo is set but the DGD CRD is missing.
  * _print_check_summary surfaces a DYNAMO section listing each DGD's
    services and ready-timeout.

Helmfile (infra/helm/):
  * dynamo-platform release added with values that mesh with our
    existing kai-scheduler install (install=false, enabled=true) and
    use bundled etcd + NATS for service discovery / event plane.

Examples:
  * Working recipe + infra pair targeting kind clusters
    (dynamo_qwen3_0.6b.{yaml,kind.infra.yaml}) plus a minimal DGD
    manifest (examples_dgd/qwen3_0.6b_kind.yaml).

Out of scope this PR (follow-ups):
  * --rayjob mode integration (needs ownerReferences on the DGD pointing
    at the RayJob).
  * Grove integration for multi-node DGDs.
  * Refit + planner autoscaling (Phase 3 — borrow slime/slynamo's
    external_discovery.py topology-fingerprint pattern).

Tests: 241 unit tests pass under Python 3.12. nrl-k8s check renders the
new example pair end-to-end against a live cluster.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
When nrl-k8s is the one applying a DGD (i.e. infra.dynamo.<key> is
declared), tie its lifetime to the training RayCluster via Kubernetes
ownerReferences. K8s GC cascades the DGD when the RayCluster is deleted,
so inference GPUs free at the same moment as the training GPUs without
hand-rolled cleanup logic.

Owner-of-choice: RayCluster, not RayJob — picking the cluster as parent
means DGD pods die the moment shutdownAfterJobFinishes fires, instead of
hanging around until ttlSecondsAfterFinished (default 1h).

User-managed DGDs (recipes that only set policy.generation.dynamo_cfg.frontend_url
without an infra.dynamo entry) are untouched: nrl-k8s never applies the
DGD, so it never sets an ownerReference, so it has nothing to cascade.

Implementation:
  * dgd.build_owner_reference helper sets controller=False
    (the dynamo operator already controls the DGD; we're a non-controlling
    owner solely for GC).
  * dgd.build_dgd_manifest accepts an owner_ref kwarg that lands in
    metadata.ownerReferences[0].
  * ensure_dgd threads owner_ref through to the builder.
  * k8s.wait_for_rayjob_raycluster_name polls the RayJob's
    .status.rayClusterName so the rayjob path can resolve KubeRay's
    auto-generated cluster name.

Long-lived path (orchestrate.run):
  * When any DGDs are declared, ensure the training RayCluster *first*,
    fetch its UID, and pass an ownerReference into the DGD apply loop.
  * When no DGDs are declared, behaviour is unchanged (no extra API call).

Rayjob path (cli._run_rayjob):
  * After applying the RayJob, poll for its .status.rayClusterName, look
    up the RayCluster's UID, then apply each DGD with an ownerReference
    pointing at the RayCluster. The DGD apply happens in parallel with
    KubeRay's own entrypoint submission — the entrypoint should
    tolerate a brief window where DGD pods are still coming up
    (a /health curl-loop in the recipe is the natural pattern).
  * --dry-run also renders the DGD manifest (without owner ref, since
    the UID isn't known until apply time).

Tests: 249 unit tests pass (8 new). Covers owner_ref attachment, the new
rayjob status poll, and end-to-end orchestrate.run wiring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
The qwen3-30b infra entrypoint pipes `python ... 2>&1 | tee "$LOG"` under
/bin/dash, which has no `set -o pipefail` and no $PIPESTATUS. A Python
crash (e.g. ImportError before training starts) leaves the pipeline
exit 0 because tee succeeds, and KubeRay records the RayJob as
SUCCEEDED — the failure is silent until someone reads the log.

Route python's stdout/stderr through a fifo that tee drains, so the
shell sees python's real exit code and `exit "$RC"` propagates it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
`nrl-k8s status` looked up the RayCluster by the bare `cluster.name`
(the value users put in `kuberay.<role>.name`). In `--rayjob` mode
KubeRay creates the RayCluster with a 5-char random suffix and writes
the suffixed name to the RayJob's `.status.rayClusterName`, so the bare
lookup 404s and every role rendered as "(not found)" for the entire
run lifetime.

When the bare cluster lookup misses, fall through to a RayJob lookup
on the same name and follow `.status.rayClusterName` to find the
suffixed cluster. The displayed `name` stays the configured name;
pod listing and daemon dashboard URLs use the resolved name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
`_patch_dgd_service_account` was unconditionally overwriting
`serviceAccountName` on every service in a DynamoGraphDeployment, which
broke the dynamo operator's standard pattern of generating a per-DGD
`<dgd>-k8s-service-discovery` SA with RBAC for `endpointslices` and
`dynamoworkermetadatas`. With our infra's SA injected instead, the
worker pods 403 on their discovery reflectors and the DGD deadlocks
at state=pending.

The dynamo operator owns the SA wiring for DGD pods. nrl-k8s should
only honour an explicit `serviceAccountName` already declared in the
manifest's extraPodSpec; otherwise leave the field unset so the
operator can fill it in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
Two coupled changes that together let the dynamo backend run a full
GRPO step end-to-end through nemo-gym:

1. `_should_use_async_rollouts` and `_should_use_nemo_gym` in
   `algorithms/grpo.py` accept `backend == "dynamo"`. DynamoGeneration
   exposes `dp_openai_server_base_urls` (the DGD frontend URL) the same
   way an async-vLLM generator does, so the gym dispatch path works
   without further changes — the gates just hard-asserted vLLM only.
   The vllm-specific `expose_http_server` check is now scoped to the
   vllm branch (Dynamo always exposes a frontend; there's no analogous
   knob).

2. `print_performance_metrics` in `algorithms/utils.py` short-circuits
   to `training_num_gpus = total_num_gpus` and `generation_num_gpus = 0`
   when `backend == "dynamo"`. With dynamo, generation lives in a
   separate DGD outside the Ray cluster, so the existing read of
   `policy.generation.colocated.resources.gpus_per_node` (which is null
   on the dynamo path) was raising `TypeError: unsupported operand
   type(s) for *: 'NoneType' and 'int'` after a successful step.
   Also guarded the per-GPU generation-throughput division to avoid a
   ZeroDivisionError in this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
`_postprocess_nemo_gym_to_nemo_rl_result` asserts that the policy's
`prompt_token_ids` for turn N+1 form a byte-identical extension of
the tokens accumulated through turn N. Against the Dynamo
`tokenize-endpoint` image (jwillthomson/dynamo-arm-rl-tokenize-endpoint-*)
this fires immediately on the first multi-turn rollout — the frontend
appears to re-tokenize the chat history rather than carry token IDs
verbatim across turns.

Disable the assert to a `RuntimeWarning` so the dynamo+nrl-k8s
integration smoke can validate the rest of the pipeline (gym dispatch,
reward computation, Megatron logprobs/training step, perf metrics
print, teardown). A `TODO(dynamo-smoke):` block marks the spot to
re-enable once the tokenize endpoint returns verbatim token IDs (or
nemo-gym is taught to re-derive contiguity from text + tokenizer).

Until then, advantages and logprobs computed on the dynamo path are
approximate — quality numbers from this rollout flow can't be trusted
yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
End-to-end smoke pair for the dynamo k8s integration on a GB300 NVL72:

  * recipe: examples/nemo_gym/grpo_workplace_assistant_dynamo_smoke.yaml
    Inherits from grpo_workplace_assistant_nemotron_nano_v2_9b.yaml,
    swaps in Qwen3-4B-Thinking-2507 with Megatron TP=1, clears the
    Nemotron MoE knobs, sets policy.generation.backend=dynamo, and
    trims to 4 rollouts/step × 1 step (no validation, no checkpoint,
    no wandb).

  * infra: infra/nrl_k8s/examples/grpo_workplace_assistant_dynamo_smoke.gb300.infra.yaml
    1-GPU Ray training cluster + a `dynamo:` block referencing the
    DGD manifest. The entrypoint passes `+policy.generation.dynamo_cfg.dgd_name=${user:}-dynamo-wpa-smoke`
    as a Hydra override since the orchestrator's auto-inject only fires
    in code_source: upload mode.

  * DGD: infra/nrl_k8s/examples_dgd/qwen3_4b_thinking_gb300.yaml
    Frontend on customer-cpu, 1× VllmDecodeWorker on a GB300 node, with
    `nvidia.com/kai-scheduler-queue: backfill` annotation (the
    operator's default queue `dynamo` is rejected by Kyverno on this
    cluster). vLLM worker uses --dyn-tool-call-parser hermes and
    --dyn-reasoning-parser qwen3.

Smoke validates: nrl-k8s cluster up + DGD apply + DGD ready + gym
3-server bring-up + Dynamo HTTP rollouts + Megatron policy step +
perf metrics print, in ~2m15s on a warm cluster.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
`mini_swe_agent → litellm → policy_model → Dynamo` JSON-serializes token
IDs and logprobs as floats; `torch.tensor(...)` then infers FloatTensor,
which downstream embedding lookups reject ("Expected indices to have
scalar types Long, Int; got FloatTensor"). Pin dtype=long on the
prompt/generation token_ids and dtype=float32 on generation_logprobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…odule

Reverts the assertion-to-warning downgrade from a847616 ("chore(nemo-gym):
downgrade token-contiguity assert to a warning"). That downgrade was a
temporary unblock for Dynamo smokes while the underlying multi-turn
re-tokenization issue was diagnosed.

Root cause: Dynamo's chat preprocessor and tokenize endpoint honor a
required_prefix_token_ids field that splices verbatim model-emitted tokens
into the template-tokenized prefix at the EOS turn boundary, but only when
the field is explicitly populated (no auto-derive on Dynamo's side, unlike
the NeMoRLOpenAIChatRequestMixin in vllm_worker_async.py).

Fix lives in the gym submodule (bumped here): vllm_model app.py now
auto-derives required_prefix_token_ids from the latest assistant message's
prompt_token_ids + generation_token_ids and forwards it to both /chat and
/tokenize. With that in place, the contiguity invariant holds on Dynamo
just as it always has on vLLM, so the strict assert is correct again.

Verified: 80/80 multi-turn rollouts on Dynamo (Qwen3-4B-Thinking +
workplace_assistant gym, 5 steps x 16 rollouts), KL 0.0005-0.0042,
reward learning healthy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dynamo operator marks .status.state=successful once it has
reconciled the desired-vs-observed pod count, but pods may still be in
ContainerCreating / image-pull / probe-failing at that moment. The CLI
would return from wait_for_dgd_ready while pods were still warming up,
and the first training-side request would hit connection-refused.

Tighten the gate to three conditions instead of one:
  * operator state == successful (existing behavior)
  * every pod owned by the DGD has all containers Ready (kubelet view)
  * the frontend service answers an HTTP request via the K8s API
    server proxy (proves the python dynamo.frontend process has
    actually bound to :8000, not just that the pod is Ready)

The frontend HTTP probe goes through connect_get_namespaced_service_proxy
so it works without standing up a port-forward. 404 from /v1/models is
treated as ready (listener is up, just doesn't recognize the path);
503 / 502 / connection-refused all fall back to "not ready".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two independent smoke artifacts:

1. Nemotron-3-Nano-30B-A3B-BF16 proof of life on Dynamo + mini_swe_agent
   (3 files). The 30B Mamba-hybrid MoE is the first non-Qwen3 model run
   through the Dynamo path on k8s. Working shape: 2 GB300 worker pods
   (TP=4 × DP=2, AdamW state sharded across all 8 ranks via the
   distributed optimizer — no host offload). 1 single-GPU Dynamo
   decoder serves the rollouts, with --dyn-tool-call-parser nemotron_deci
   + --dyn-reasoning-parser nemotron_nano and mamba_ssm_cache=float32.
   Recipe sets moe_per_layer_logging=false explicitly (parent V2 9B
   recipe doesn't ship it; the policy worker reads the key and would
   KeyError without it) and make_sequence_length_divisible_by=4 (TP=4
   + sequence_parallel requires the seq-len-divides constraint).

2. WPA planner load source recipe (1 file). 32 prompts × 8 generations
   = 256 in-flight WPA rollouts that drive KV utilization past the
   Dynamo planner's static `latency` scale-up threshold (40%). Pure
   chat traffic — no singularity / SWE-bench CPU overhead — so the load
   lands almost entirely on the decoder's KV cache. Pair this with a
   planner-enabled DGD (single decoder, max_gpu_budget=4) and watch the
   planner scale replicas up toward the cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jthomson04 jthomson04 force-pushed the dynamo-k8s-integration branch from f929d7d to 582c243 Compare May 11, 2026 16:28
The previous order (apply RayJob → wait for RayCluster → apply DGD with
ownerRef) lost a race against KubeRay: the driver entrypoint is fired
as soon as the RayCluster head is Ready, but DGD pods are still being
admitted, image-pulled, and probed at that point. The gym's first
generation request would hit connection-refused on the dynamo frontend.

Invert the order so inference is already serving before the driver
starts:
  1. apply each DGD (no ownerRef yet — the RayCluster doesn't exist)
  2. wait for DGDs to reach the three-gate readiness state
  3. apply the RayJob
  4. wait for KubeRay to spawn the RayCluster, get its UID
  5. PATCH the ownerReference back onto each DGD

K8s GC reconciles ownerReferences continuously, so adding the ref late
still produces cascade-delete-on-RayCluster-shutdown — the DGD is just
not anchored during the brief window between steps 1 and 5.

Rollback paths cover both failure modes: if a DGD apply fails partway
through, any siblings already up are deleted (they have no parent yet
so GC won't reap them); if the RayJob apply fails, all DGDs that were
brought up are deleted so the next attempt isn't blocked by name
collision on the admission webhook.

Works across KubeRay versions — no dependency on .spec.suspend or any
other feature gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@terrykong terrykong mentioned this pull request May 11, 2026
…ds fix

The gym branch jthomson04/required-prefix-token-ids was rebased onto
current NVIDIA-NeMo/Gym main and dropped three unrelated tool-call /
reasoning-content fixes that had accumulated on the original commit.
The submodule now points at the scoped fix: one commit, one file, just
the required_prefix_token_ids auto-derive on /chat + tokenize allowlist.

Functionally equivalent to the previous pointer for the contiguity
assert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 0159c21 (PR #2429 from dynamo-k8s-integration)

✅ Submodules that are properly updated:

Gym: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@jthomson04 jthomson04 force-pushed the dynamo-k8s-integration branch from 2e5307d to 0159c21 Compare May 11, 2026 23:19
@github-actions
Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 0159c21 (PR #2429 from dynamo-k8s-integration)

✅ Submodules that are properly updated:

Gym: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant