fix(automodel): support combined projection tensor sync#2457
Draft
taivu1998 wants to merge 1 commit into
Draft
Conversation
Signed-off-by: taivu1998 <46636857+taivu1998@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Fixes #2072.
LoRA weight syncing with the DTensor v2 Automodel path fails for Qwen2/Llama-style custom Automodels because the sync path streams individual tensors through
_maybe_adapt_tensor_to_hf(), but Automodel's combined-projection adapter did not provideconvert_single_tensor_to_hf().That left affected runs with two bad options:
policy.dtensor_cfg.automodel_kwargs.force_hf=true.Root Cause
CombinedProjectionStateDictAdapteralready handled full state-dict conversion into_hf(), including:qkv_proj->q_proj,k_proj,v_projgate_up_proj->gate_proj,up_projHowever, the adapter lacked the equivalent single-tensor conversion hook required by NeMo-RL's streamed weight update path.
Changes
CombinedProjectionStateDictAdapter.convert_single_tensor_to_hf()in Automodel._extra_statekeys, and pass-through tensors.force_hfcompatibility fallback for older or incomplete future adapters.force_hf, while synthetic incomplete adapters still do._maybe_adapt_tensor_to_hf()anddtensor_params_generator().force_hf.Validation
Passed:
uvx ruff check 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/models/common/combined_projection/state_dict_adapter.py 3rdparty/Automodel-workspace/Automodel/tests/unit_tests/models/common/test_combined_projection_state_dict_adapter.py tests/unit/models/automodel/test_automodel_setup.py tests/unit/models/policy/test_dtensor_worker_v2.pyuvx ruff format --check 3rdparty/Automodel-workspace/Automodel/nemo_automodel/components/models/common/combined_projection/state_dict_adapter.py 3rdparty/Automodel-workspace/Automodel/tests/unit_tests/models/common/test_combined_projection_state_dict_adapter.py tests/unit/models/automodel/test_automodel_setup.py tests/unit/models/policy/test_dtensor_worker_v2.pygit diff --checkin the RL repo and Automodel submodule/usr/local/bin/python3.13 -m py_compile ...on the touched Python filespytest 3rdparty/Automodel-workspace/Automodel/tests/unit_tests/models/common/test_combined_projection_state_dict_adapter.pyThe full Automodel combined-projection adapter file passed locally: 22 passed.
Partially blocked locally:
pytest tests/unit/models/automodel/test_automodel_setup.py -k MaybeSetForceHfcollected after adding ad-hoc optional deps/stubs, but the session-wide Ray fixture failed before test bodies due missing Ray dashboard dependencies in this macOS ad-hoc environment (aiohttp_cors). This is environment setup noise, not a failure in the changed logic.Draft Note
This is a draft because the parent RL PR advances the Automodel submodule to a commit currently published on the contributor fork and represented by companion Automodel draft PR NVIDIA-NeMo/Automodel#2202. Before marking ready, the submodule pointer should be reconciled with the final Automodel upstream commit/merge strategy.