Add async checkpoint feature by VincentCheungKokomo · Pull Request #1703 · InternLM/xtuner

VincentCheungKokomo · 2026-04-22T16:35:37Z

Add async DCP checkpoint support

This change adds async checkpoint saving for XTuner v1 training. The trainer
now supports an async_checkpoint option, starts merged async DCP saves for model
and optimizer state, and defers checkpoint metadata finalization until the
background staging/upload futures complete.

The async path writes model and optimizer state into a merged weights/
checkpoint format, while resume keeps compatibility with both the new merged
format and the existing model/optimizer DCP format. Checkpoint metadata is only
registered after async save completion, so failed async saves are not exposed as
resumable checkpoints.

The training engine now creates a dedicated process group for async checkpoint
work, supports merged async save/load helpers, and cleans up the async process
group at trainer shutdown.

Tests and benchmark configs are added to cover async checkpoint intervals and
provide reproducible verification runs for 8B and 30B models.

HAOCHENYE · 2026-04-27T17:43:41Z

 from xtuner.v1.utils.grad_norm import cal_grad_norm


+if BlockingAsyncStager is not None:


In [2]: fw = FileSystemWriter("./") In [3]: from torch.distributed.checkpoint.staging import AsyncStager, BlockingAsyncStager In [4]: isinstance(fw, AsyncStager) Out[4]: True

is _CachingStagingWriter necessary?

HAOCHENYE · 2026-04-27T17:46:25Z

                    options=_set_options,
                )

+    def load_dcp_merged(


The state dict format should be consistant with async_save and save. If merged_state_dict performs better, just replace the current implementation.

HAOCHENYE · 2026-04-27T17:52:08Z

+        self._async_checkpoint = async_checkpoint
+        self._pending_staging_futures: list[Future] | None = None
+        self._pending_upload_futures: list[Future] | None = None
+        self._pending_checkpoint_finalize: _CheckpointFinalize | None = None


Following dcp.async_save, the async interface should return an awaitable future. We can assume there is at most one in-flight async save future in the trainer at any time, and the trainer will always wait for the previous async save to finish before issuing a new one.

HAOCHENYE · 2026-04-27T17:54:28Z

            ckpt_saved = self._maybe_save(is_snapshot=False)
            if not ckpt_saved:
                _ = self._maybe_save(is_snapshot=True)
+            checkpoint_time = time.time() - time_before_checkpoint


Just log the checkpoint time in train_engine

HAOCHENYE · 2026-05-07T06:10:11Z

+        self._async_checkpoint_pg: dist.ProcessGroup | None = None
+        self._async_state_dict_cache: dict[str, Any] | None = None
+        if async_checkpoint:
+            self._async_checkpoint_pg = dist.new_group(backend="gloo")


Please leave a comment to describe why we need a gloo process group here.

HAOCHENYE · 2026-05-07T06:12:29Z

+        if not hasattr(dcp, "async_save"):
+            raise RuntimeError(
+                "dcp.async_save is not available in this PyTorch version. "
+                "Please upgrade PyTorch or set async_checkpoint=False."
+            )


unnecessary check.

HAOCHENYE · 2026-05-07T06:14:58Z

+            cached_has_optim = "optimizer" in self._async_state_dict_cache
+            if cached_has_optim != save_optimizer:
+                self._async_state_dict_cache = None


when will this branch be triggered?

HAOCHENYE · 2026-05-07T06:18:13Z

+            if cached_has_optim != save_optimizer:
+                self._async_state_dict_cache = None
+        storage_writer = FileSystemWriter(weights_dir, cache_staged_state_dict=True)
+        storage_writer.state_dict_cache = self._async_state_dict_cache


Is this injection necessary?

cache_staged_state_dict keeps pinned staging buffers on the FileSystemWriter instance. XTuner creates one writer per checkpoint path, so carry the cache across writers to preserve steady-state async_save launch performance.

HAOCHENYE · 2026-05-07T06:19:00Z

+    def destroy_async_checkpoint_pg(self) -> None:
+        """Destroy the dedicated gloo process group used for async checkpoint."""
+        self._async_state_dict_cache = None
+        if self._async_checkpoint_pg is not None:
+            dist.destroy_process_group(self._async_checkpoint_pg)
+            self._async_checkpoint_pg = None


call it in __del__

HAOCHENYE · 2026-05-07T07:07:24Z

+            future = self._engine.async_save_dcp(weights_dir=weights_path, save_optimizer=save_optimizer)
+            t_dcp = time.time() - t_dcp
+            # Defer metadata save until async save completes.
+            self._pending_checkpoint = _PendingCheckpoint(


Trainer shouldn't need to know about _CheckpointFinalize. Instead, you can call Future.add_done_callback in TrainEngine so that the future tracks timing correctly. The trainer just needs to wait for it (or await it).

HAOCHENYE · 2026-05-07T10:02:43Z

+            cur_epoch = self._cur_epoch
+            train_time_offset = self._train_time + self._train_time_offset
+
+            def finalize_checkpoint_metadata() -> None:


Considering keeping the original implementation, I think even if dcp hasn't finished saving, it should be fine to save the meta information first. Try to avoid increasing code complexity just to introduce the asynchronous save feature.

HAOCHENYE · 2026-05-07T10:03:17Z

+            dcp_label = "async_save_dcp"
+            future = self._engine.async_save_dcp(weights_dir=weights_path, save_optimizer=save_optimizer)
+            t_dcp = time.time() - t_dcp


The time of asynchronous saving should be collected and printed by the engine

VincentCheungKokomo force-pushed the feature/async-checkpoint branch 2 times, most recently from 7a7136b to 302b6ec Compare April 23, 2026 03:47

HAOCHENYE reviewed Apr 27, 2026

View reviewed changes

VincentCheungKokomo force-pushed the feature/async-checkpoint branch 6 times, most recently from 695d2b3 to b6701ef Compare April 30, 2026 08:08

HAOCHENYE reviewed May 7, 2026

View reviewed changes

VincentCheungKokomo force-pushed the feature/async-checkpoint branch from b6701ef to b8c953d Compare May 7, 2026 09:39

HAOCHENYE reviewed May 7, 2026

View reviewed changes

VincentCheungKokomo force-pushed the feature/async-checkpoint branch from b8c953d to 1eb91e5 Compare May 7, 2026 10:44

Add async checkpoint feature

962cc16

VincentCheungKokomo force-pushed the feature/async-checkpoint branch from 1eb91e5 to 962cc16 Compare May 8, 2026 03:19

		from xtuner.v1.utils.grad_norm import cal_grad_norm


		if BlockingAsyncStager is not None:

Conversation

VincentCheungKokomo commented Apr 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HAOCHENYE May 7, 2026 • edited by nil0x9 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HAOCHENYE May 7, 2026 •

edited by nil0x9

Loading