Skip to content

fix(automodel): support combined projection tensor sync#2457

Draft
taivu1998 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-2072-automodel-single-tensor
Draft

fix(automodel): support combined projection tensor sync#2457
taivu1998 wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
taivu1998:tdv/issue-2072-automodel-single-tensor

Conversation

@taivu1998
Copy link
Copy Markdown

Problem

Fixes #2072.

LoRA weight syncing with the DTensor v2 Automodel path fails for Qwen2/Llama-style custom Automodels because the sync path streams individual tensors through _maybe_adapt_tensor_to_hf(), but Automodel's combined-projection adapter did not provide convert_single_tensor_to_hf().

That left affected runs with two bad options:

  • fail when a custom model adapter was used for streamed weight sync, or
  • force the HF implementation with policy.dtensor_cfg.automodel_kwargs.force_hf=true.

Root Cause

CombinedProjectionStateDictAdapter already handled full state-dict conversion in to_hf(), including:

  • qkv_proj -> q_proj, k_proj, v_proj
  • gate_up_proj -> gate_proj, up_proj
  • LoRA-A duplication
  • output-side/base/LoRA-B splitting
  • 1-D bias gather/restore handling

However, the adapter lacked the equivalent single-tensor conversion hook required by NeMo-RL's streamed weight update path.

Changes

  • Update the Automodel submodule to companion draft PR fix: add combined projection single-tensor conversion Automodel#2202.
  • Add CombinedProjectionStateDictAdapter.convert_single_tensor_to_hf() in Automodel.
  • Cover single-tensor QKV/gate-up conversion for base weights, biases, LoRA-A, LoRA-B, excluded _extra_state keys, and pass-through tensors.
  • Keep NeMo-RL's force_hf compatibility fallback for older or incomplete future adapters.
  • Update NeMo-RL tests so Qwen2/Llama adapters no longer auto-enable force_hf, while synthetic incomplete adapters still do.
  • Add NeMo-RL DTensor worker regression tests using the real combined-projection adapter through _maybe_adapt_tensor_to_hf() and dtensor_params_generator().
  • Refresh exemplar config comments so Qwen2/Llama are no longer documented as requiring force_hf.

Validation

Passed:

  • uvx ruff check 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/models/common/combined_projection/state_dict_adapter.py 3rdparty/Automodel-workspace/Automodel/tests/unit_tests/models/common/test_combined_projection_state_dict_adapter.py tests/unit/models/automodel/test_automodel_setup.py tests/unit/models/policy/test_dtensor_worker_v2.py
  • uvx ruff format --check 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/models/common/combined_projection/state_dict_adapter.py 3rdparty/Automodel-workspace/Automodel/tests/unit_tests/models/common/test_combined_projection_state_dict_adapter.py tests/unit/models/automodel/test_automodel_setup.py tests/unit/models/policy/test_dtensor_worker_v2.py
  • git diff --check in the RL repo and Automodel submodule
  • /usr/local/bin/python3.13 -m py_compile ... on the touched Python files
  • pytest 3rdparty/Automodel-workspace/Automodel/tests/unit_tests/models/common/test_combined_projection_state_dict_adapter.py

The full Automodel combined-projection adapter file passed locally: 22 passed.

Partially blocked locally:

  • pytest tests/unit/models/automodel/test_automodel_setup.py -k MaybeSetForceHf collected after adding ad-hoc optional deps/stubs, but the session-wide Ray fixture failed before test bodies due missing Ray dashboard dependencies in this macOS ad-hoc environment (aiohttp_cors). This is environment setup noise, not a failure in the changed logic.

Draft Note

This is a draft because the parent RL PR advances the Automodel submodule to a commit currently published on the contributor fork and represented by companion Automodel draft PR NVIDIA-NeMo/Automodel#2202. Before marking ready, the submodule pointer should be reconciled with the final Automodel upstream commit/merge strategy.

Signed-off-by: taivu1998 <46636857+taivu1998@users.noreply.github.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automodel: CombinedProjectionStateDictAdapter missing convert_single_tensor_to_hf

2 participants