fix(automodel): support combined projection tensor sync by taivu1998 · Pull Request #2457 · NVIDIA-NeMo/RL

taivu1998 · 2026-05-10T11:19:18Z

Problem

Fixes #2072.

LoRA weight syncing with the DTensor v2 Automodel path fails for Qwen2/Llama-style custom Automodels because the sync path streams individual tensors through _maybe_adapt_tensor_to_hf(), but Automodel's combined-projection adapter did not provide convert_single_tensor_to_hf().

That left affected runs with two bad options:

fail when a custom model adapter was used for streamed weight sync, or
force the HF implementation with policy.dtensor_cfg.automodel_kwargs.force_hf=true.

Root Cause

CombinedProjectionStateDictAdapter already handled full state-dict conversion in to_hf(), including:

qkv_proj -> q_proj, k_proj, v_proj
gate_up_proj -> gate_proj, up_proj
LoRA-A duplication
output-side/base/LoRA-B splitting
1-D bias gather/restore handling

However, the adapter lacked the equivalent single-tensor conversion hook required by NeMo-RL's streamed weight update path.

Changes

Update the Automodel submodule to companion draft PR fix: add combined projection single-tensor conversion Automodel#2202.
Add CombinedProjectionStateDictAdapter.convert_single_tensor_to_hf() in Automodel.
Cover single-tensor QKV/gate-up conversion for base weights, biases, LoRA-A, LoRA-B, excluded _extra_state keys, and pass-through tensors.
Keep NeMo-RL's force_hf compatibility fallback for older or incomplete future adapters.
Update NeMo-RL tests so Qwen2/Llama adapters no longer auto-enable force_hf, while synthetic incomplete adapters still do.
Add NeMo-RL DTensor worker regression tests using the real combined-projection adapter through _maybe_adapt_tensor_to_hf() and dtensor_params_generator().
Refresh exemplar config comments so Qwen2/Llama are no longer documented as requiring force_hf.

Validation

Passed:

uvx ruff check 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/models/common/combined_projection/state_dict_adapter.py 3rdparty/Automodel-workspace/Automodel/tests/unit_tests/models/common/test_combined_projection_state_dict_adapter.py tests/unit/models/automodel/test_automodel_setup.py tests/unit/models/policy/test_dtensor_worker_v2.py
uvx ruff format --check 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/models/common/combined_projection/state_dict_adapter.py 3rdparty/Automodel-workspace/Automodel/tests/unit_tests/models/common/test_combined_projection_state_dict_adapter.py tests/unit/models/automodel/test_automodel_setup.py tests/unit/models/policy/test_dtensor_worker_v2.py
git diff --check in the RL repo and Automodel submodule
/usr/local/bin/python3.13 -m py_compile ... on the touched Python files
pytest 3rdparty/Automodel-workspace/Automodel/tests/unit_tests/models/common/test_combined_projection_state_dict_adapter.py

The full Automodel combined-projection adapter file passed locally: 22 passed.

Partially blocked locally:

pytest tests/unit/models/automodel/test_automodel_setup.py -k MaybeSetForceHf collected after adding ad-hoc optional deps/stubs, but the session-wide Ray fixture failed before test bodies due missing Ray dashboard dependencies in this macOS ad-hoc environment (aiohttp_cors). This is environment setup noise, not a failure in the changed logic.

Draft Note

This is a draft because the parent RL PR advances the Automodel submodule to a commit currently published on the contributor fork and represented by companion Automodel draft PR NVIDIA-NeMo/Automodel#2202. Before marking ready, the submodule pointer should be reconciled with the final Automodel upstream commit/merge strategy.

Signed-off-by: taivu1998 <46636857+taivu1998@users.noreply.github.com>

copy-pr-bot · 2026-05-10T11:19:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

fix(automodel): support combined projection tensor sync

c1d1836

Signed-off-by: taivu1998 <46636857+taivu1998@users.noreply.github.com>

github-actions Bot added the community-request label May 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(automodel): support combined projection tensor sync#2457

fix(automodel): support combined projection tensor sync#2457
taivu1998 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-2072-automodel-single-tensor

taivu1998 commented May 10, 2026

Uh oh!

copy-pr-bot Bot commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

taivu1998 commented May 10, 2026

Problem

Root Cause

Changes

Validation

Draft Note

Uh oh!

copy-pr-bot Bot commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants