Skip to content

[dev] fix no_shard training convergency and add unittest for no_shard#3835

Open
wplf wants to merge 2 commits intoNVIDIA:devfrom
wplf:jinliang/fix-fsdp-no-shard-dev
Open

[dev] fix no_shard training convergency and add unittest for no_shard#3835
wplf wants to merge 2 commits intoNVIDIA:devfrom
wplf:jinliang/fix-fsdp-no-shard-dev

Conversation

@wplf
Copy link
Copy Markdown
Member

@wplf wplf commented Mar 12, 2026

What does this PR do ?

main PR #3754

  • Fixes no_shard convergence by correcting grad_norm and all_reduce logic.
  • fix no_shard grad_norm calculation
  • fix no_shard all_reduce grad continue bug
  • Make no_shard to be compatiable without overlap_param_gather and overlap_grad_reduce
  • add unittest for no_shard
  • add assert False for no_shard and init_model_with_meta_device follow [fully_shard.py:326]

add unittest for no_shard and add empty cache to avoid OOM

add meta_device_check for no_shard following fully_shard.py:326

Signed-off-by: jinliangl <jinliangl@nvidia.com>
@wplf wplf requested review from a team as code owners March 12, 2026 15:29
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@wplf wplf changed the title fix no_shard training convergency and add unittest for no_shard [dev] fix no_shard training convergency and add unittest for no_shard Mar 12, 2026
@wplf wplf requested review from cspades and yaoyu-33 March 12, 2026 15:31
Copy link
Copy Markdown
Contributor

@FDecaYed FDecaYed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left comments

Comment thread megatron/core/optimizer/__init__.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants