Skip to content

feat: add SFT entropy logging and validation loss monitoring#1925

Open
none0663 wants to merge 3 commits into
THUDM:mainfrom
none0663:feature/sft-entropy-and-val-loss
Open

feat: add SFT entropy logging and validation loss monitoring#1925
none0663 wants to merge 3 commits into
THUDM:mainfrom
none0663:feature/sft-entropy-and-val-loss

Conversation

@none0663
Copy link
Copy Markdown
Contributor

Add two monitoring features for SFT training to detect overfitting:

  1. Training entropy (--log-sft-entropy):

    • Computes token-level entropy under no_grad to avoid OOM
    • Logged as train/entropy to TensorBoard/WandB
  2. Validation loss (--val-data + --val-interval):

    • Full DP-parallel val loss computation with dynamic batching
    • Token-weighted aggregation across ranks (not rank-mean)
    • CP-correct reduction via get_sum_of_sample_mean
    • Deadlock-safe: all ranks synchronize before collective ops
    • Runs initial val before training for baseline
    • Also logs val/entropy when --log-sft-entropy is set
Clipboard_Screenshot_1779194790

none0663 and others added 3 commits May 19, 2026 20:20
Add two monitoring features for SFT training to detect overfitting:

1. Training entropy (--log-sft-entropy):
   - Computes token-level entropy under no_grad to avoid OOM
   - Logged as train/entropy to TensorBoard/WandB

2. Validation loss (--val-data + --val-interval):
   - Full DP-parallel val loss computation with dynamic batching
   - Token-weighted aggregation across ranks (not rank-mean)
   - CP-correct reduction via get_sum_of_sample_mean
   - Deadlock-safe: all ranks synchronize before collective ops
   - Runs initial val before training for baseline
   - Also logs val/entropy when --log-sft-entropy is set

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Divide local_tokens by cp_size before all_reduce to avoid overcounting
   when loss_masks are replicated across CP ranks.
2. Use max(0, start_rollout_id - 1) for baseline val step to avoid
   discontinuity and step collision on training resume.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. When val_data has fewer samples than dp_size, replicate to all ranks
   instead of leaving empty shards (which would skip val entirely).
2. Skip baseline val when it would collide with the first periodic val
   at the same step (val_interval=1 + start_rollout_id=0).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant