Skip to content

Fix W&B step mismatch by consolidating log calls#15235

Open
yurekami wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
yurekami:fix/wandb-step-mismatch
Open

Fix W&B step mismatch by consolidating log calls#15235
yurekami wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
yurekami:fix/wandb-step-mismatch

Conversation

@yurekami
Copy link
Contributor

Summary

Fixes #15204

The W&B _step counter was showing approximately 2x the actual global_step because multiple separate log() calls in the training step caused WandbLogger to increment its internal counter multiple times.

Root cause: Each call to self.lightning_module.log() triggers WandbLogger to auto-increment its _step counter. When logging global_step and step separately, the counter increments twice per training step.

Fix: Consolidated the separate log() calls into a single log_dict() call in:

  • megatron_strategy.py
  • fsdp_strategy.py
  • fsdp2_strategy.py

Test plan

  • Verified syntax check passes
  • The fix ensures W&B step aligns with global_step by using a single log_dict() call

🤖 Generated with Claude Code

The W&B _step counter was showing approximately 2x the actual
global_step because `log('global_step')` and `log('step')` were
called separately, causing WandbLogger's internal _step counter
to increment twice per training step.

This fix consolidates both metrics into a single `log_dict()` call
in all three strategy files, ensuring W&B step aligns correctly
with trainer.global_step for accurate cross-run comparisons.

Files modified:
- nemo/lightning/pytorch/strategies/megatron_strategy.py
- nemo/lightning/pytorch/strategies/fsdp_strategy.py
- nemo/lightning/pytorch/strategies/fsdp2_strategy.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: yurekami <yurekami@users.noreply.github.com>
@yurekami yurekami force-pushed the fix/wandb-step-mismatch branch from b48eb36 to 5a02c95 Compare December 29, 2025 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

W&B _step does not match trainer/global_step

1 participant