Skip to content

[Optimization] Incremental checkpoint save for dcp on torch 2.7.x (ARM CPU optimization)#1525

Open
tina-wen wants to merge 4 commits intoInternLM:mainfrom
tina-wen:dcp_save
Open

[Optimization] Incremental checkpoint save for dcp on torch 2.7.x (ARM CPU optimization)#1525
tina-wen wants to merge 4 commits intoInternLM:mainfrom
tina-wen:dcp_save

Conversation

@tina-wen
Copy link

@tina-wen tina-wen commented Mar 3, 2026

Description

This PR optimizes dcp.save performance on ARM CPUs by implementing incremental metadata saving for torch 2.7.1.

Implementation

  • Incremental save: Only save metadata changes after first checkpoint
  • xtuner framework patch: Added patch_for_dcp_finish config flag
  • API update: Switch to storage_writer/planner for dcp.save

Performance

Checkpoint saving performance improved by up to 85%

Compatibility

✅ Works with existing ckpt_save
✅ No precision issues on recovery
✅ No PyTorch/PTA source changes

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets faster distributed checkpoint (DCP) saves on ARM CPUs (torch 2.7.1) by introducing incremental/cached planning and write-result handling, plus an optional monkeypatch to reduce finish-time overhead.

Changes:

  • Add a patch_for_dcp_finish config flag to optionally monkeypatch torch DCP internals.
  • Switch TrainEngine.save_dcp() to use storage_writer + planner on torch 2.7.x via new XtunnerWriter and XtunerCacheSavePlanner.
  • Introduce new engine utilities for caching save plans/metadata and write results.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
xtuner/v1/train/trainer.py Adds patch_for_dcp_finish config/plumbing to enable a DCP finish monkeypatch.
xtuner/v1/patch/torch_dcp_planner.py Adds a patched _save_state_dict implementation and a function to apply the monkeypatch.
xtuner/v1/patch/init.py Exposes the new patch function from the patch package.
xtuner/v1/engine/xtuner_storage.py New FileSystemWriter subclass that can cache write results to reduce repeated overhead.
xtuner/v1/engine/xtuner_cache_planner.py New DefaultSavePlanner subclass that caches global plan/metadata to support incremental saves.
xtuner/v1/engine/train_engine.py Uses the new writer/planner for torch 2.7.x DCP saves.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tina-wen tina-wen force-pushed the dcp_save branch 2 times, most recently from 42b4328 to 26e7aa7 Compare March 10, 2026 07:59
@HAOCHENYE HAOCHENYE added the npu label Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants