-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Is your feature request related to a problem? Please describe.
When training large language models on local GPUs, users often face interruptions due to:
- Power outages or system crashes
- Accidental terminal closure
- Need to free up GPU resources temporarily
- Long training sessions that span multiple days
Currently, if training is interrupted, users must restart from scratch, losing hours or days of training progress and wasting GPU resources. This is particularly frustrating for users with limited GPU resources (4-6GB VRAM) who are already working with constrained hardware.
Describe the solution you'd like
Implement a comprehensive checkpoint management system with the following features:
-
Automatic Checkpoint Saving
- Save training checkpoints at configurable intervals (e.g., every N epochs or every X minutes)
- Include model weights, optimizer state, training metrics, and hyperparameters
- Store checkpoints in a dedicated directory with timestamps
-
Resume Training from Checkpoint
- UI option to resume interrupted training sessions
- Automatically detect available checkpoints for incomplete training runs
- Display checkpoint information (epoch number, loss, timestamp) in the UI
-
Checkpoint Management Dashboard
- View all saved checkpoints in the React UI
- Show checkpoint metadata: creation time, epoch, validation loss, model size
- Delete old/unwanted checkpoints to free up disk space
- Export specific checkpoints for backup or sharing
-
Smart Storage Management
- Configurable retention policy (keep last N checkpoints, keep best checkpoint, etc.)
- Automatic cleanup of old checkpoints to prevent disk space issues
- Disk space usage indicator in the UI
Describe alternatives you've considered
-
Manual checkpoint saving: Require users to manually save checkpoints, but this is error-prone and doesn't help with unexpected interruptions
-
Only save final model: Current approach, but doesn't address the core problem of interrupted training
-
External checkpoint tools: Users could use Hugging Face Trainer callbacks, but this defeats the "no-code" philosophy of ModelForge
Additional context
This feature would significantly improve the user experience, especially for:
- Users in regions with unstable power supply
- Long training sessions on consumer hardware
- Users experimenting with different hyperparameters who want to resume from specific points
- Educational settings where shared machines may need to be freed up
Implementation considerations:
- Leverage Hugging Face Transformers' built-in checkpoint functionality
- Add checkpoint configuration options in the React UI
- Implement background checkpoint saving to minimize training interruption
- Consider compression for checkpoint files to save disk space
- Add progress indicators showing time until next checkpoint
Benefits:
- Improved reliability and user confidence
- Better resource utilization (don't waste GPU hours on interrupted training)
- More experimentation-friendly (users can try different approaches without fear)
- Aligns with the "beginner-friendly" and "hackathon-ready" goals of ModelForge