Skip to content

Deepseek v4 support - Automodel path#2460

Draft
sharonyu-115 wants to merge 20 commits into
NVIDIA-NeMo:mainfrom
sharonyu-115:deepseek-v4-support
Draft

Deepseek v4 support - Automodel path#2460
sharonyu-115 wants to merge 20 commits into
NVIDIA-NeMo:mainfrom
sharonyu-115:deepseek-v4-support

Conversation

@sharonyu-115
Copy link
Copy Markdown
Contributor

What does this PR do ?

Support GRPO training for DeepSeek-v4-flash in the Automodel path.

Issues

Address issue:
#2331

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

sharonyu-115 and others added 20 commits April 24, 2026 01:51
Swap the vllm extra to a prebuilt DSV4 wheel extracted from
vllm/vllm-openai:deepseekv4-cu129 (commit 306b63f67) and bump deep_gemm
to the exact commit vllm-DSV4 vendors (7f2a703), so the top-level
DeepGEMM install provides the DSV4-specific kernels
(tf32_hc_prenorm_gemm, fp8_fp4_*, etc.) on Python 3.13 and vllm's
_import_deep_gemm picks it up instead of the cpython-312 vendored _C.so.

pyproject changes:
- torch 2.10.0 -> 2.11.0 (project deps, build group, and uv override;
  the DSV4 wheel's Requires-Dist hard-pins torch==2.11.0 and its
  .so files link against torch 2.11's libtorch)
- torchvision 0.25.0 -> 0.26.0 (project deps)
- torchaudio 2.10.0 -> 2.11.0 (uv override)
- deep_gemm commit 7b6b5563... -> 7f2a703e... to match vllm-DSV4's
  cmake/external_projects/deepgemm.cmake
- vllm==0.17.1 -> vllm @ file:// extracted DSV4 wheel (1.2 GB,
  cp38-abi3-linux_x86_64, 8 of 9 .so files are abi3-stable so they
  load on Python 3.13 despite being built against 3.12)

requires-python stays >=3.13.13. Existing transformers==5.3.0 override
already covers the wheel's transformers<5 metadata pin.

Verified end-to-end:
- uv sync --extra vllm resolves cleanly on Python 3.13 (472 packages)
- vllm._C loads against torch 2.11 in VllmGenerationWorker venv
- DeepseekV4ForCausalLM present in ModelRegistry
- llm.generate on DeepSeek-V4-Flash returns "Paris" for
  "The capital of France is" (matches Stage A `vllm serve` output
  byte-for-byte)

Validation script at tools/test_vllm_dsv4_inference.py (separate commit).
Readiness doc at docs/model-readiness/deepseek-v4/ (separate commit).

Signed-off-by: Shuang Yu <shuangy@nvidia.com>
…sformers to 5.5

Vendor in-flight DSV4-Flash support from NVIDIA-NeMo/Automodel#2039.

.gitmodules + submodule: Automodel -> khazic/Automodel_lao @
feat/deepseek-v4-flash, gitlink at ab2d7a08 (PR NVIDIA-NeMo#2039 head, 24
commits). The PR registers DeepseekV4ForCausalLM natively in
nemo_automodel/_transformers/registry.py, ships FP4-expert + FP8-
attention loaders, and a state_dict_adapter with
convert_single_tensor_to_hf so refit-to-vLLM works through the
existing dtensor_params_generator path.

transformers 5.3.0 -> 5.5.0 (pyproject + uv.lock). Required
because Automodel main forwarded past our previous pin and now
imports transformers.models.gemma4.modeling_gemma4 unconditionally
in components/distributed/parallelizer.py:49 (gemma4 ships in
transformers >= 5.5.0.dev). vLLM stays pinned to the local DSV4
wheel; the override on transformers in [tool.uv] supersedes the
wheel's <5 metadata declaration. Runtime smoke on the rebaked sqsh
confirms vllm._C loads clean against transformers 5.5 and DSV4
returns "Paris." end-to-end via the standard NeMo-RL vllm_worker
init path.

Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
vLLM wheel: replace internal-build g62d441ee8 with teammate's local build
off vllm-project/vllm main @ c0879d948 (post #40860 "[Feat] DeepSeek V4
Rebased" merge). The new wheel includes the dummy-load fix (lazy
finalize_weights() in MoE forward), so the load_format=auto recipe
override can be revisited once a quick smoke validates dummy-load works.

deep_gemm: 7f2a703e -> 891d57b4 to match vllm main's
cmake/external_projects/deepgemm.cmake pin.

flashinfer-python / flashinfer-cubin: 0.6.4 -> 0.6.8.post1; cutlass-dsl
>=4.4.0.dev1 -> >=4.4.2. Matches the new vllm wheel's Requires-Dist
declarations exactly.

requires-python: add upper bound <3.14. The new wheel is cp313-cp313 only
(not cp38-abi3 like the prior internal builds), so 3.14 splits in uv lock
fail to match.

override-dependencies: tighten flashinfer pins from >=0.5.0 to
==0.6.8.post1. uv overrides REPLACE constraints rather than merge them,
so a loose floor was letting the resolver pick newer post-releases (0.6.9)
instead of the wheel's exact pin. sglang's 0.6.7.post2 is intentionally
superseded since this branch is vllm-only.

Companion to bring-up status update at
docs/model-readiness/deepseek-v4/bring-up-status.md (will follow once
Automodel branch swap lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Validated in jobs 11363285 (bake) + 11363286 (inference):
greedy generation against DeepSeek-V4-Flash-Base produced coherent text
("The capital of France is" -> " Paris. The capital of Germany is
Berlin..."). The original `_load_w13: size of tensor a (2048) must
match tensor b (16) at dim 0` blocker is now cleared at the vLLM half;
Automodel half still pending teammate's branch.

## What changed

### pyproject.toml + uv.lock — torch 2.10 rollback

Larkz's vllm wheel `vllm-0.19.2rc1.dev219+gc0879d948.cu129-cp313-cp313`
ships METADATA `Requires-Dist: torch==2.11.0` but its `_C.abi3.so` was
compiled in a `torch==2.10.0` venv (per larkz/nemorl-ds4/pyproject.toml
pin) and references the OLD 4-arg `MessageLogger(char const*, int, int,
bool)` constructor that torch 2.11's libc10 dropped. First bake (job
11362567) failed at venv rebuild with `undefined symbol:
_ZN3c1013MessageLoggerC1EPKciib`.

Reverted in 5 sites:
- project.dependencies torch 2.11.0 -> 2.10.0
- project.dependencies torchvision 0.26.0 -> 0.25.0
- dependency-groups.build torch 2.11.0 -> 2.10.0
- tool.uv.override-dependencies torch / torchaudio -> 2.10.0
- tool.uv.override-dependencies + explicit torchvision==0.25.0

uv lock confirmed: torch / torchaudio / torchvision rolled to 2.10.0+cu129
/ 2.10.0+cu129 / 0.25.0+cu129; nccl 2.28.9 -> 2.27.5 transitively.

### tools/patch_vllm_dsv4_base_fp8_quick.sh — post-#40860 anchors

Anchor #7 rewritten: upstream's rewritten `load_weights` now captures
the result before calling `self.model.finalize_mega_moe_weights()`, so
"return loader.load_weights(...)" no longer matches; new anchor
"loaded_params = loader.load_weights(...)".

Added 8th anchor: forces `use_mega_moe = False` when
`VLLM_DSV4_BASE_FP8=1`. Without this, post-load
`finalize_mega_moe_weights()` would invoke `experts.finalize_weights()`
on a non-MegaMoE FusedMoE layer (Base routes through `Fp8MoEMethod`,
not `DeepseekV4MegaMoEExperts`). Self-contained env gating means
recipes don't also need to override `moe_backend`.

Verified against the larkz wheel: 9 anchors apply cleanly,
`py_compile` clean.

### env_refresh_dsv4.sh — bake the Base patch

Now applies two patches over the freshly-built worker venv before
`--container-save`:
1. `tools/vllm_deepseek_v4_config_patch.py` (defensive superset of
   upstream main's narrow rope kwargs fix)
2. `tools/patch_vllm_dsv4_base_fp8_quick.sh` (env-gated Base FP8
   support; idempotent, no-op when `VLLM_DSV4_BASE_FP8` unset)

### tools/test_vllm_dsv4_base_inference.py (new)

Counterpart to test_vllm_dsv4_inference.py for Flash. Sets
`VLLM_DSV4_BASE_FP8=1` before vllm import, re-applies the patch script
defensively (idempotent), runs greedy generation through standalone
`vllm.LLM`. Ran clean against the new sqsh.

### slurm_jobs/dsv4_base_{bake,inference_test}.sub (new)

Submit scripts for the chained bake + inference flow:
- bake: 1 node 8 GPU 3:55 walltime, --container-save
- inference: 1 node 8 GPU 1:00 walltime, --dependency=afterok

### docs/model-readiness/deepseek-v4* (new + updated)

Tracks the full DSV4 bring-up. Bring-up status doc now reflects:
- vLLM half of Base blocker CLEARED
- Torch ABI rollback chronicled
- New sqsh artifact path
  (.../nemo-rl-dsv4-vllm-c0879d-base-torch210-2026-04-27.sqsh ~99.8 GiB)
- Inference job 11363286 success, generated text logged
- Old broken sqsh marked for forensics

## What still blocks

Automodel-side Base FP8-block-quant state_dict_adapter is pending the
teammate's new branch. Once that lands, sqsh rebake + GRPO smoke can
follow. The vLLM-side recipe override `load_format: auto` may now be
removable since the new wheel includes the upstream dummy-load fix
(`finalize_weights()` 3 hits in main's deepseek_v4.py); validation
deferred until first GRPO smoke attempt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Re-applies the in-scope subset of reverted ef7bfb4 (kept the
docs/scripts out of git per user request).

pyproject.toml + uv.lock — torch 2.11 -> 2.10:
larkz's vllm wheel (vllm-0.19.2rc1.dev219+gc0879d948) ships METADATA
"Requires-Dist: torch==2.11.0" but its _C.abi3.so was compiled in a
torch==2.10 venv (per larkz/nemorl-ds4/pyproject.toml pin) and references
the OLD 4-arg MessageLogger(char const*, int, int, bool) constructor that
torch 2.11's libc10 dropped.  Bake job 11362567 failed at venv rebuild
with `undefined symbol _ZN3c1013MessageLoggerC1EPKciib`.  Reverted in
5 sites (project deps, build group, override-deps).  uv lock confirms
torch / torchaudio / torchvision rolled to 2.10.0+cu129 / 2.10.0+cu129
/ 0.25.0+cu129; nccl 2.28.9 -> 2.27.5 transitively.  Bake job 11363285
+ inference job 11363286 verified the fix end-to-end.

tools/patch_vllm_dsv4_base_fp8_quick.sh — post-#40860 anchors:
- Anchor #7 rewritten: upstream's rewritten load_weights now captures
  the result before calling self.model.finalize_mega_moe_weights(), so
  "return loader.load_weights(...)" no longer matches; new anchor
  "loaded_params = loader.load_weights(...)".
- Added 8th anchor: forces use_mega_moe = False when
  VLLM_DSV4_BASE_FP8=1.  Without this, post-load
  finalize_mega_moe_weights() would invoke experts.finalize_weights()
  on a non-MegaMoE FusedMoE layer (Base routes through Fp8MoEMethod,
  not DeepseekV4MegaMoEExperts).  Self-contained env gating means
  recipes don't also need to override moe_backend.
Verified: 9 anchors apply cleanly against the larkz wheel,
py_compile clean, end-to-end inference passed in job 11363286.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
Enable end-to-end GRPO for DeepSeek V4 Base under FP8 block-quant
generation by adding fp8_ds_mla as a recognized kv_cache_dtype and
bypassing the KV-scale refit machinery that does not apply to DSV4's
MLA cache layout (scales are packed inline; no Parameter-style
k_scale/v_scale exists).

- vllm/config.py, vllm/quantization/fp8.py: extend kv_cache_dtype
  literal and validation to include fp8_ds_mla.
- vllm/quantization/fp8.py: skip the BaseKVCacheMethod
  process_weights_after_loading patch for fp8_ds_mla. Adapt to the
  upstream should_use_deepgemm_for_fp8_linear API change (now expects
  weight_shape tuple) and handle the BMM 3D layout used by DeepseekV4
  wo_a (rebind .data to preserve Parameter identity / weight_loader).
  Mirror weight_scale into weight_scale_inv so DeepseekV4MLAAttention
  reads the post-DeepGEMM-transform scale.
- vllm/vllm_backend.py, vllm/vllm_generation.py,
  policy/workers/megatron_policy_worker.py: skip the fp8 KV scale
  sync / process_weights / metadata-key paths for fp8_ds_mla.
- algorithms/grpo.py: drop the DTensor / async-rollouts /
  pipeline_model_parallel_size==1 asserts for fp8_ds_mla; those
  constraints come from KV-scale refit which fp8_ds_mla does not
  perform.
- tools/patch_vllm_dsv4_base_fp8_quick.sh: fall back to "ue8m0" when
  scale_fmt is missing from quantization_config (vLLM's
  Fp8Config.from_config strips it).

Signed-off-by: larkzhang-nv <larkz@nvidia.com>
Register vLLM-shipped DeepseekV4Config at module import time so
AutoConfig.from_pretrained resolves the deepseek_v4 model_type without
requiring trust_remote_code paths to fetch it lazily.

Signed-off-by: larkzhang-nv <larkz@nvidia.com>
Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
DeepSeek-V4's vLLM model does not expose packed_modules_mapping, and several HF parameter names need to be resolved through vLLM's mapper plus DSV4-specific fused module names before NeMo-RL can tell which weights should be block-FP8 cast during refit.

The DSV4 DeepGEMM BMM path also stores some parameters, such as attention wo_a, as 3D local tensors in vLLM while the policy stream provides the 2D global BF16 weight and block scales. Add a refit-time loader wrapper that slices the local TP rows, reshapes BMM weights, transforms block scales into the layout vLLM expects, and restores the original loaders afterward.

When a vLLM loader still fails, print the target parameter and loaded tensor shape/dtype/device so future FP8 refit mismatches are visible in the Ray worker logs.

Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
Pin Automodel to e2564e22, which adds an env-controlled chunked scatter path for grouped MoE FP32 accumulation. This lets DSV4 experiments set NEMO_AUTOMODEL_MOE_SCATTER_CHUNK_ROWS to reduce the output2.float() peak during policy training.

Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
Add an 8-node GRPO recipe for DeepSeek-V4-Flash-Base using Automodel BF16 training with EP64 and vLLM FP8 rollouts. The config carries the DSV4 Base expert-layout override and MoE scatter chunking env var needed by the current bringup path.

Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
Resolve string-valued torch dtypes from YAML before constructing the Automodel optimizer so TE FusedAdam can receive kwargs such as exp_avg_dtype: torch.bfloat16.

Remove the old DTensor AdamW foreach/fused false defaults from grpo_math_1B.yaml; those kwargs are not accepted by TE FusedAdam and are no longer required by the current Automodel optimizer setup.

Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
init_fp8's hf_overrides was overwriting the disk's quantization_config with
the FP8_BLOCK_QUANT_KWARGS constant, dropping DSV4-specific keys like
scale_fmt='ue8m0' that deepseek_v4.py reads directly via
config.quantization_config['scale_fmt']. Merge the disk's qc under our
constant so DSV4 keys survive while shared keys (block_size, fmt, etc.)
still take our value.

Applies on top of 579cf37a7.

Signed-off-by: Shuang Yu <shuangy@nvidia.com>
vLLM's DeepGemmExperts._act_mul_quant drops swiglu_limit on the FP8 MoE
path (only DeepGemmFP4Experts propagates gemm1_clamp_limit). For
DSV4-Flash-Base (swiglu_limit=10) this lets routed-expert SwiGLU outputs
go unbounded and rare-vocab tokens win argmax at clause boundaries.

Add a runtime monkey-patch that pre-clamps gate/up halves of the kernel
input, gated by NRL_SWIGLU_LIMIT (no-op when unset). Wire it into both
apply_fp8_patches (Ray worker) and init_fp8 (driver, for the non-Ray mp
executor; Ray workers re-apply idempotently via collective_rpc).

Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Use rewards from before reward shaping when computing the per-prompt baseline and std used by dynamic sampling. This prevents shaping penalties, such as DAPO overlong penalties, from making otherwise constant task rewards look valid for dynamic sampling.

Issue: NVIDIA-NeMo#2431
Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
Pin Automodel from e2564e22 to f6f4b1a on the dsv4-base-fp8 line.

This brings in DSV4 HC and attention numeric fixes: compressed RoPE YaRN correction, fp32 attention-sink softmax, HC residual combine transpose, duplicate sparse-index mask handling, and fp32 FSDP2 sharding for HC modules.

It also includes the lower-memory optimizer resume path that loads optimizer state on CPU, plus the shared-expert SwiGLU fp32 clamp to match the DSV4 reference implementation.

Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 11, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants