Skip to content

feat: add --max-checkpoint-count to limit saved checkpoints#1914

Open
JIANG54864 wants to merge 2 commits into
THUDM:mainfrom
JIANG54864:feat/max-checkpoint-count
Open

feat: add --max-checkpoint-count to limit saved checkpoints#1914
JIANG54864 wants to merge 2 commits into
THUDM:mainfrom
JIANG54864:feat/max-checkpoint-count

Conversation

@JIANG54864
Copy link
Copy Markdown

Add a new training argument --max-checkpoint-count to automatically prune old Megatron checkpoint directories during training. When set, only the N most recent iter_* directories are retained after each save, preventing unbounded disk usage on long training runs.

Changes:

  • slime/utils/arguments.py: add --max-checkpoint-count argument and validation
  • slime/backends/megatron_utils/checkpoint.py: add cleanup_old_checkpoints()
  • slime/backends/megatron_utils/actor.py: call cleanup after save_model()

Usage: --save /path/to/ckpt --save-interval 20 --max-checkpoint-count 5

JIANG54864 and others added 2 commits May 16, 2026 12:20
Add a new training argument --max-checkpoint-count to automatically
prune old Megatron checkpoint directories during training. When set,
only the N most recent iter_* directories are retained after each
save, preventing unbounded disk usage on long training runs.

Changes:
- slime/utils/arguments.py: add --max-checkpoint-count argument and validation
- slime/backends/megatron_utils/checkpoint.py: add cleanup_old_checkpoints()
- slime/backends/megatron_utils/actor.py: call cleanup after save_model()

Usage: --save /path/to/ckpt --save-interval 20 --max-checkpoint-count 5

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant