Skip to content

feat(vllm): add delta-compressed collective refit#2444

Open
HollowMan6 wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
HollowMan6:delta_weight_transfer
Open

feat(vllm): add delta-compressed collective refit#2444
HollowMan6 wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
HollowMan6:delta_weight_transfer

Conversation

@HollowMan6
Copy link
Copy Markdown
Member

@HollowMan6 HollowMan6 commented May 8, 2026

What does this PR do ?

Adds optional delta-compressed weight transfer for non-colocated vLLM collective refit.

This introduces a delta-aware packed weight transfer protocol that can send either full weights or additive deltas, with support for dense, sparse_indices, and sparse_bitmask delta encodings. The trainer source rank keeps a pinned CPU baseline of the last successfully synced HF-format weights, computes deltas against that baseline, and periodically sends full syncs based on full_sync_interval.

The feature is disabled by default and only applies to non-colocated vLLM refit. Colocated CUDA IPC, vLLM FP8 weights, and ModelOpt quantized vLLM paths are rejected.

Issues

N/A

Usage

Enable under the vLLM generation config:

policy:
  generation:
    backend: vllm
    colocated:
      enabled: false
    delta_compression:
      enabled: true
      dtype: ${policy.precision}
      transport: sparse_indices  # dense, sparse_indices, or sparse_bitmask
      full_sync_interval: 20

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • Adds DeltaCompressionTracker and delta-aware packed transfer utilities.
  • Uses only the source trainer rank for pinned CPU baseline tracking.
  • Reuses the full weight transfer API shape: producers can send full or delta chunks.
  • Applies deltas in vLLM through the existing weight loaders under an additive load context.
  • Wires DTensor v1, DTensor v2, and Megatron policy workers through a shared dispatch_packed_weight_transfer(...) helper.
  • Adds shared dtype and transfer type aliases/constants.
  • Adds config examples for GRPO and distillation.

E2E test results with:

uv run --no-sync bash tests/functional/grpo_non_colocated.sh \
  policy.generation.delta_compression.enabled=true \
  policy.generation.delta_compression.dtype=bfloat16 \
  policy.generation.delta_compression.transport=sparse_indices \
  policy.generation.delta_compression.full_sync_interval=20 \
  logger.monitor_gpus=true
transfer_and_update_weights:
  step 1: 0.7228s
  step 2: 0.1147s

prepare_for_generation/total:
  step 1: 0.7229s
  step 2: 0.1148s

total_step_time:
  step 1: 6.6503s
  step 2: 4.1538s

valid_tokens_per_sec_per_gpu:
  step 1: 121.05
  step 2: 187.78

train/token_mult_prob_error:
  step 1: 1.0161
  step 2: 1.0110
bash tests/functional/grpo_non_colocated.sh \
  policy.model_name=Qwen/Qwen3-8B-Base \
  cluster.gpus_per_node=5 \
  policy.generation.colocated.resources.gpus_per_node=1 \
  policy.generation.delta_compression.enabled=true \
  policy.generation.delta_compression.dtype=bfloat16 \
  policy.generation.delta_compression.transport=sparse_indices \
  policy.generation.delta_compression.full_sync_interval=20 \
  logger.monitor_gpus=true
transfer_and_update_weights:
  step 1: 6.0736s   # full sync
  step 2: 0.7874s   # delta sync

total_step_time:
  step 1: 17.9783s
  step 2: 8.0348s

valid_tokens_per_sec_per_gpu:
  step 1: 11.95
  step 2: 34.31

token_mult_prob_error:
  step 1: 1.0193
  step 2: 1.0163

Adds optional delta-compressed weight transfer for
non-colocated vLLM collective refit.

This introduces a delta-aware packed weight transfer
protocol that can send either full weights or additive
deltas, with support for `dense`, `sparse_indices`, and
`sparse_bitmask` delta encodings. The trainer source rank
keeps a pinned CPU baseline of the last successfully
synced HF-format weights, computes deltas against that
baseline, and periodically sends full syncs
based on `full_sync_interval`.

The feature is disabled by default and only applies
to non-colocated vLLM refit. Colocated CUDA IPC,
vLLM FP8 weights, and ModelOpt quantized vLLM paths are rejected.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Copilot AI review requested due to automatic review settings May 8, 2026 18:14
@HollowMan6 HollowMan6 requested review from a team as code owners May 8, 2026 18:14
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 8, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional delta-compressed weight transfer protocol for non-colocated vLLM collective refit, enabling the trainer source rank to send full weights or additive deltas (dense / sparse_indices / sparse_bitmask) and apply deltas additively through existing vLLM weight loaders.

Changes:

  • Introduces a delta-aware packed weight transfer protocol (full/delta/done) with sparse delta encodings and a trainer-side DeltaCompressionTracker baseline.
  • Integrates the new transfer path into DTensor v1/v2 and Megatron policy workers via a shared dispatch_packed_weight_transfer(...) helper.
  • Updates vLLM collective refit to optionally consume the new full/delta protocol and adds unit tests + example configs.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/utils/test_weight_transfer.py Adds unit coverage for delta tracker behavior, sparse transports, additive load context, and producer/consumer roundtrips.
nemo_rl/utils/weight_transfer.py Implements delta tracking, sparse encodings, packed full/delta broadcast protocol, and additive load context.
nemo_rl/utils/weight_transfer_types.py Defines shared literal types/constants for delta compression and transfer kinds.
nemo_rl/utils/torch_dtypes.py Centralizes dtype string→torch.dtype mappings (canonical + aliases).
nemo_rl/models/policy/workers/megatron_policy_worker.py Switches collective weight broadcast to the delta-aware dispatcher when enabled.
nemo_rl/models/policy/workers/dtensor_policy_worker.py Switches collective weight broadcast to the delta-aware dispatcher when enabled.
nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py Switches collective weight broadcast to the delta-aware dispatcher when enabled.
nemo_rl/models/generation/vllm/vllm_worker.py Determines whether to use delta transfer and forwards that flag to the vLLM worker extension.
nemo_rl/models/generation/vllm/vllm_worker_async.py Forwards the delta-transfer enablement flag in the async prepare_refit_info path.
nemo_rl/models/generation/vllm/vllm_backend.py Adds delta-aware collective consumer path and additive-delta loading through existing loaders.
nemo_rl/models/generation/vllm/config.py Extends vLLM generation config typing with delta_compression settings.
nemo_rl/models/automodel/setup.py Reuses canonical dtype mapping from torch_dtypes instead of duplicating it.
examples/configs/grpo_math_1B.yaml Documents/introduces the new delta_compression config block (disabled by default).
examples/configs/distillation_math.yaml Documents/introduces the new delta_compression config block (disabled by default).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread nemo_rl/utils/weight_transfer.py Outdated
@ZhiyuLi-Nvidia
Copy link
Copy Markdown
Contributor

Awesome @HollowMan6

I found delta weight transfer has its own weight transfer function, which seems a duplicated one compared with the full weight transfer.

It is out of the scope of this PR, but is there any block to have delt and full weight transfer shared the same communication function while have their independent protocol to pack, unpack the model weights?

@HollowMan6
Copy link
Copy Markdown
Member Author

Thank you @ZhiyuLi-Nvidia for pointing this out, I just did some refactoring according to your suggestion, and it looks fine.

…nication function

Signed-off-by: Hollow Man <hollowman@opensuse.org>
@HollowMan6 HollowMan6 force-pushed the delta_weight_transfer branch from dce52fe to 857e72f Compare May 8, 2026 23:28
…lling load_delta_weights_func

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants