Deepseek v4 support - Automodel path by sharonyu-115 · Pull Request #2460 · NVIDIA-NeMo/RL

sharonyu-115 · 2026-05-11T09:39:34Z

What does this PR do ?

Support GRPO training for DeepSeek-v4-flash in the Automodel path.

Issues

Address issue:
#2331

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Swap the vllm extra to a prebuilt DSV4 wheel extracted from vllm/vllm-openai:deepseekv4-cu129 (commit 306b63f67) and bump deep_gemm to the exact commit vllm-DSV4 vendors (7f2a703), so the top-level DeepGEMM install provides the DSV4-specific kernels (tf32_hc_prenorm_gemm, fp8_fp4_*, etc.) on Python 3.13 and vllm's _import_deep_gemm picks it up instead of the cpython-312 vendored _C.so. pyproject changes: - torch 2.10.0 -> 2.11.0 (project deps, build group, and uv override; the DSV4 wheel's Requires-Dist hard-pins torch==2.11.0 and its .so files link against torch 2.11's libtorch) - torchvision 0.25.0 -> 0.26.0 (project deps) - torchaudio 2.10.0 -> 2.11.0 (uv override) - deep_gemm commit 7b6b5563... -> 7f2a703e... to match vllm-DSV4's cmake/external_projects/deepgemm.cmake - vllm==0.17.1 -> vllm @ file:// extracted DSV4 wheel (1.2 GB, cp38-abi3-linux_x86_64, 8 of 9 .so files are abi3-stable so they load on Python 3.13 despite being built against 3.12) requires-python stays >=3.13.13. Existing transformers==5.3.0 override already covers the wheel's transformers<5 metadata pin. Verified end-to-end: - uv sync --extra vllm resolves cleanly on Python 3.13 (472 packages) - vllm._C loads against torch 2.11 in VllmGenerationWorker venv - DeepseekV4ForCausalLM present in ModelRegistry - llm.generate on DeepSeek-V4-Flash returns "Paris" for "The capital of France is" (matches Stage A `vllm serve` output byte-for-byte) Validation script at tools/test_vllm_dsv4_inference.py (separate commit). Readiness doc at docs/model-readiness/deepseek-v4/ (separate commit). Signed-off-by: Shuang Yu <shuangy@nvidia.com>

…sformers to 5.5 Vendor in-flight DSV4-Flash support from NVIDIA-NeMo/Automodel#2039. .gitmodules + submodule: Automodel -> khazic/Automodel_lao @ feat/deepseek-v4-flash, gitlink at ab2d7a08 (PR NVIDIA-NeMo#2039 head, 24 commits). The PR registers DeepseekV4ForCausalLM natively in nemo_automodel/_transformers/registry.py, ships FP4-expert + FP8- attention loaders, and a state_dict_adapter with convert_single_tensor_to_hf so refit-to-vLLM works through the existing dtensor_params_generator path. transformers 5.3.0 -> 5.5.0 (pyproject + uv.lock). Required because Automodel main forwarded past our previous pin and now imports transformers.models.gemma4.modeling_gemma4 unconditionally in components/distributed/parallelizer.py:49 (gemma4 ships in transformers >= 5.5.0.dev). vLLM stays pinned to the local DSV4 wheel; the override on transformers in [tool.uv] supersedes the wheel's <5 metadata declaration. Runtime smoke on the rebaked sqsh confirms vllm._C loads clean against transformers 5.5 and DSV4 returns "Paris." end-to-end via the standard NeMo-RL vllm_worker init path. Signed-off-by: Shuang Yu <shuangy@nvidia.com>

Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>

vLLM wheel: replace internal-build g62d441ee8 with teammate's local build off vllm-project/vllm main @ c0879d948 (post #40860 "[Feat] DeepSeek V4 Rebased" merge). The new wheel includes the dummy-load fix (lazy finalize_weights() in MoE forward), so the load_format=auto recipe override can be revisited once a quick smoke validates dummy-load works. deep_gemm: 7f2a703e -> 891d57b4 to match vllm main's cmake/external_projects/deepgemm.cmake pin. flashinfer-python / flashinfer-cubin: 0.6.4 -> 0.6.8.post1; cutlass-dsl >=4.4.0.dev1 -> >=4.4.2. Matches the new vllm wheel's Requires-Dist declarations exactly. requires-python: add upper bound <3.14. The new wheel is cp313-cp313 only (not cp38-abi3 like the prior internal builds), so 3.14 splits in uv lock fail to match. override-dependencies: tighten flashinfer pins from >=0.5.0 to ==0.6.8.post1. uv overrides REPLACE constraints rather than merge them, so a loose floor was letting the resolver pick newer post-releases (0.6.9) instead of the wheel's exact pin. sglang's 0.6.7.post2 is intentionally superseded since this branch is vllm-only. Companion to bring-up status update at docs/model-readiness/deepseek-v4/bring-up-status.md (will follow once Automodel branch swap lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>

Validated in jobs 11363285 (bake) + 11363286 (inference): greedy generation against DeepSeek-V4-Flash-Base produced coherent text ("The capital of France is" -> " Paris. The capital of Germany is Berlin..."). The original `_load_w13: size of tensor a (2048) must match tensor b (16) at dim 0` blocker is now cleared at the vLLM half; Automodel half still pending teammate's branch. ## What changed ### pyproject.toml + uv.lock — torch 2.10 rollback Larkz's vllm wheel `vllm-0.19.2rc1.dev219+gc0879d948.cu129-cp313-cp313` ships METADATA `Requires-Dist: torch==2.11.0` but its `_C.abi3.so` was compiled in a `torch==2.10.0` venv (per larkz/nemorl-ds4/pyproject.toml pin) and references the OLD 4-arg `MessageLogger(char const*, int, int, bool)` constructor that torch 2.11's libc10 dropped. First bake (job 11362567) failed at venv rebuild with `undefined symbol: _ZN3c1013MessageLoggerC1EPKciib`. Reverted in 5 sites: - project.dependencies torch 2.11.0 -> 2.10.0 - project.dependencies torchvision 0.26.0 -> 0.25.0 - dependency-groups.build torch 2.11.0 -> 2.10.0 - tool.uv.override-dependencies torch / torchaudio -> 2.10.0 - tool.uv.override-dependencies + explicit torchvision==0.25.0 uv lock confirmed: torch / torchaudio / torchvision rolled to 2.10.0+cu129 / 2.10.0+cu129 / 0.25.0+cu129; nccl 2.28.9 -> 2.27.5 transitively. ### tools/patch_vllm_dsv4_base_fp8_quick.sh — post-#40860 anchors Anchor #7 rewritten: upstream's rewritten `load_weights` now captures the result before calling `self.model.finalize_mega_moe_weights()`, so "return loader.load_weights(...)" no longer matches; new anchor "loaded_params = loader.load_weights(...)". Added 8th anchor: forces `use_mega_moe = False` when `VLLM_DSV4_BASE_FP8=1`. Without this, post-load `finalize_mega_moe_weights()` would invoke `experts.finalize_weights()` on a non-MegaMoE FusedMoE layer (Base routes through `Fp8MoEMethod`, not `DeepseekV4MegaMoEExperts`). Self-contained env gating means recipes don't also need to override `moe_backend`. Verified against the larkz wheel: 9 anchors apply cleanly, `py_compile` clean. ### env_refresh_dsv4.sh — bake the Base patch Now applies two patches over the freshly-built worker venv before `--container-save`: 1. `tools/vllm_deepseek_v4_config_patch.py` (defensive superset of upstream main's narrow rope kwargs fix) 2. `tools/patch_vllm_dsv4_base_fp8_quick.sh` (env-gated Base FP8 support; idempotent, no-op when `VLLM_DSV4_BASE_FP8` unset) ### tools/test_vllm_dsv4_base_inference.py (new) Counterpart to test_vllm_dsv4_inference.py for Flash. Sets `VLLM_DSV4_BASE_FP8=1` before vllm import, re-applies the patch script defensively (idempotent), runs greedy generation through standalone `vllm.LLM`. Ran clean against the new sqsh. ### slurm_jobs/dsv4_base_{bake,inference_test}.sub (new) Submit scripts for the chained bake + inference flow: - bake: 1 node 8 GPU 3:55 walltime, --container-save - inference: 1 node 8 GPU 1:00 walltime, --dependency=afterok ### docs/model-readiness/deepseek-v4* (new + updated) Tracks the full DSV4 bring-up. Bring-up status doc now reflects: - vLLM half of Base blocker CLEARED - Torch ABI rollback chronicled - New sqsh artifact path (.../nemo-rl-dsv4-vllm-c0879d-base-torch210-2026-04-27.sqsh ~99.8 GiB) - Inference job 11363286 success, generated text logged - Old broken sqsh marked for forensics ## What still blocks Automodel-side Base FP8-block-quant state_dict_adapter is pending the teammate's new branch. Once that lands, sqsh rebake + GRPO smoke can follow. The vLLM-side recipe override `load_format: auto` may now be removable since the new wheel includes the upstream dummy-load fix (`finalize_weights()` 3 hits in main's deepseek_v4.py); validation deferred until first GRPO smoke attempt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>

…-end" This reverts commit ef7bfb4.

Re-applies the in-scope subset of reverted ef7bfb4 (kept the docs/scripts out of git per user request). pyproject.toml + uv.lock — torch 2.11 -> 2.10: larkz's vllm wheel (vllm-0.19.2rc1.dev219+gc0879d948) ships METADATA "Requires-Dist: torch==2.11.0" but its _C.abi3.so was compiled in a torch==2.10 venv (per larkz/nemorl-ds4/pyproject.toml pin) and references the OLD 4-arg MessageLogger(char const*, int, int, bool) constructor that torch 2.11's libc10 dropped. Bake job 11362567 failed at venv rebuild with `undefined symbol _ZN3c1013MessageLoggerC1EPKciib`. Reverted in 5 sites (project deps, build group, override-deps). uv lock confirms torch / torchaudio / torchvision rolled to 2.10.0+cu129 / 2.10.0+cu129 / 0.25.0+cu129; nccl 2.28.9 -> 2.27.5 transitively. Bake job 11363285 + inference job 11363286 verified the fix end-to-end. tools/patch_vllm_dsv4_base_fp8_quick.sh — post-#40860 anchors: - Anchor #7 rewritten: upstream's rewritten load_weights now captures the result before calling self.model.finalize_mega_moe_weights(), so "return loader.load_weights(...)" no longer matches; new anchor "loaded_params = loader.load_weights(...)". - Added 8th anchor: forces use_mega_moe = False when VLLM_DSV4_BASE_FP8=1. Without this, post-load finalize_mega_moe_weights() would invoke experts.finalize_weights() on a non-MegaMoE FusedMoE layer (Base routes through Fp8MoEMethod, not DeepseekV4MegaMoEExperts). Self-contained env gating means recipes don't also need to override moe_backend. Verified: 9 anchors apply cleanly against the larkz wheel, py_compile clean, end-to-end inference passed in job 11363286. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>

Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>

Enable end-to-end GRPO for DeepSeek V4 Base under FP8 block-quant generation by adding fp8_ds_mla as a recognized kv_cache_dtype and bypassing the KV-scale refit machinery that does not apply to DSV4's MLA cache layout (scales are packed inline; no Parameter-style k_scale/v_scale exists). - vllm/config.py, vllm/quantization/fp8.py: extend kv_cache_dtype literal and validation to include fp8_ds_mla. - vllm/quantization/fp8.py: skip the BaseKVCacheMethod process_weights_after_loading patch for fp8_ds_mla. Adapt to the upstream should_use_deepgemm_for_fp8_linear API change (now expects weight_shape tuple) and handle the BMM 3D layout used by DeepseekV4 wo_a (rebind .data to preserve Parameter identity / weight_loader). Mirror weight_scale into weight_scale_inv so DeepseekV4MLAAttention reads the post-DeepGEMM-transform scale. - vllm/vllm_backend.py, vllm/vllm_generation.py, policy/workers/megatron_policy_worker.py: skip the fp8 KV scale sync / process_weights / metadata-key paths for fp8_ds_mla. - algorithms/grpo.py: drop the DTensor / async-rollouts / pipeline_model_parallel_size==1 asserts for fp8_ds_mla; those constraints come from KV-scale refit which fp8_ds_mla does not perform. - tools/patch_vllm_dsv4_base_fp8_quick.sh: fall back to "ue8m0" when scale_fmt is missing from quantization_config (vLLM's Fp8Config.from_config strips it). Signed-off-by: larkzhang-nv <larkz@nvidia.com>

Register vLLM-shipped DeepseekV4Config at module import time so AutoConfig.from_pretrained resolves the deepseek_v4 model_type without requiring trust_remote_code paths to fetch it lazily. Signed-off-by: larkzhang-nv <larkz@nvidia.com>

Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>

DeepSeek-V4's vLLM model does not expose packed_modules_mapping, and several HF parameter names need to be resolved through vLLM's mapper plus DSV4-specific fused module names before NeMo-RL can tell which weights should be block-FP8 cast during refit. The DSV4 DeepGEMM BMM path also stores some parameters, such as attention wo_a, as 3D local tensors in vLLM while the policy stream provides the 2D global BF16 weight and block scales. Add a refit-time loader wrapper that slices the local TP rows, reshapes BMM weights, transforms block scales into the layout vLLM expects, and restores the original loaders afterward. When a vLLM loader still fails, print the target parameter and loaded tensor shape/dtype/device so future FP8 refit mismatches are visible in the Ray worker logs. Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>

Pin Automodel to e2564e22, which adds an env-controlled chunked scatter path for grouped MoE FP32 accumulation. This lets DSV4 experiments set NEMO_AUTOMODEL_MOE_SCATTER_CHUNK_ROWS to reduce the output2.float() peak during policy training. Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>

Add an 8-node GRPO recipe for DeepSeek-V4-Flash-Base using Automodel BF16 training with EP64 and vLLM FP8 rollouts. The config carries the DSV4 Base expert-layout override and MoE scatter chunking env var needed by the current bringup path. Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>

Resolve string-valued torch dtypes from YAML before constructing the Automodel optimizer so TE FusedAdam can receive kwargs such as exp_avg_dtype: torch.bfloat16. Remove the old DTensor AdamW foreach/fused false defaults from grpo_math_1B.yaml; those kwargs are not accepted by TE FusedAdam and are no longer required by the current Automodel optimizer setup. Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>

init_fp8's hf_overrides was overwriting the disk's quantization_config with the FP8_BLOCK_QUANT_KWARGS constant, dropping DSV4-specific keys like scale_fmt='ue8m0' that deepseek_v4.py reads directly via config.quantization_config['scale_fmt']. Merge the disk's qc under our constant so DSV4 keys survive while shared keys (block_size, fmt, etc.) still take our value. Applies on top of 579cf37a7. Signed-off-by: Shuang Yu <shuangy@nvidia.com>

vLLM's DeepGemmExperts._act_mul_quant drops swiglu_limit on the FP8 MoE path (only DeepGemmFP4Experts propagates gemm1_clamp_limit). For DSV4-Flash-Base (swiglu_limit=10) this lets routed-expert SwiGLU outputs go unbounded and rare-vocab tokens win argmax at clause boundaries. Add a runtime monkey-patch that pre-clamps gate/up halves of the kernel input, gated by NRL_SWIGLU_LIMIT (no-op when unset). Wire it into both apply_fp8_patches (Ray worker) and init_fp8 (driver, for the non-Ray mp executor; Ray workers re-apply idempotently via collective_rpc). Signed-off-by: Shuang Yu <shuangy@nvidia.com>

Use rewards from before reward shaping when computing the per-prompt baseline and std used by dynamic sampling. This prevents shaping penalties, such as DAPO overlong penalties, from making otherwise constant task rewards look valid for dynamic sampling. Issue: NVIDIA-NeMo#2431 Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>

Pin Automodel from e2564e22 to f6f4b1a on the dsv4-base-fp8 line. This brings in DSV4 HC and attention numeric fixes: compressed RoPE YaRN correction, fp32 attention-sink softmax, HC residual combine transpose, duplicate sparse-index mask handling, and fp32 FSDP2 sharding for HC modules. It also includes the lower-memory optimizer resume path that loads optimizer state on CPU, plus the shared-expert SwiGLU fp32 clamp to match the DSV4 reference implementation. Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>

copy-pr-bot · 2026-05-11T09:39:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

sharonyu-115 and others added 20 commits April 24, 2026 01:51

fix: add DSV4 Base FP8 vLLM quick patch

0813759

Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>

Revert "feat: enable DSV4-Flash-Base FP8-block-quant inference end-to…

dd7d5a2

…-end" This reverts commit ef7bfb4.

feat: support DSV4 Base FP8 Automodel SFT

04acc26

Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>

chore: rebase Automodel DSV4 Base FP8 pin

4d78c86

feat: support DeepSeek V4 chat template

a03b45b

Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>

sharonyu-115 assigned sharonyu-115, jQizhang and zpqiu May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepseek v4 support - Automodel path#2460

Deepseek v4 support - Automodel path#2460
sharonyu-115 wants to merge 20 commits into
NVIDIA-NeMo:mainfrom
sharonyu-115:deepseek-v4-support

sharonyu-115 commented May 11, 2026

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sharonyu-115 commented May 11, 2026

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants