Add Feature Universal Checkpoint for AutoTP by nathon-lee · Pull Request #7908 · deepspeedai/DeepSpeed

nathon-lee · 2026-03-16T14:45:57Z

Hi DeepSpeed team — thanks for your time reviewing this PR.

Summary

Add Universal Checkpoint (UC) metadata support for DeepSpeed AutoTP to enable saving and resuming from Universal Checkpoints.

Motivation

AutoTP partitions parameters across TP ranks. To make checkpoints portable and restorable, we need a stable UC metadata representation that can be collected at save time and consumed at restore time.

What’s in this PR

Collect AutoTP-specific Universal Checkpoint metadata for TP-partitioned parameters.
Provide restore/merge helpers that normalize shapes and correctly interpret the saved conversion/partition view.
Keep existing (non-AutoTP / non-UC) checkpoint paths unchanged (no behavior change expected for other users).

Testing

pytest -q tests/unit/runtime/tensor_parallel/test_autotp_universal_checkpoint.py
pytest -q tests/unit/checkpoint/test_autotp_universal_checkpoint.py

Request for feedback

Could you please take a look at the UC metadata schema and let me know if you’d prefer any changes to naming, field placement, or compatibility expectations? I’m happy to iterate quickly based on your guidance.

References

Refs: (Draft) [Roadmap] DeepSpeed Roadmap Q2 2026 #7861 (Q2 2026 roadmap — AutoTP Universal Checkpoint support)

This reverts commit ff88670. Co-authored-by: nathon-lee <248585198+nathon-lee@users.noreply.github.com>

Revert "fix: update 1 file reformatted." (ff88670)

This reverts commit b90aee5.

Revert accidental Muon optimizer code re-introduction from copilot PRs

Signed-off-by: nathon-lee <leejianwoo@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ab684d950

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-16T14:58:17Z

deepspeed/module_inject/layers.py

+            conversion_meta = _get_param_uc_conversion_meta(param)
+            if not conversion_meta:
+                continue


Include sharded embeddings in UC metadata collection

AutoTP shards nn.Embedding weights in _slice_embedding() along hidden dimension (dim=1), but this collector skips any parameter without ds_autotp_universal_checkpoint_meta, so embedding parameters never enter UNIVERSAL_CHECKPOINT_INFO; during conversion they fall back to default merge behavior (cat_dim=0 in ds_to_universal.merge_tp_slices), which reconstructs embeddings along the wrong axis and breaks restores when TP degree changes.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-16T14:58:18Z

deepspeed/module_inject/layers.py

+        self._set_param_uc_meta(self.weight,
+                                partition_type='column',
+                                partition_dim=0,
+                                logical_shape=original_weight_shape,
+                                output_shape=(original_out_dim, ),
+                                original_shape=original_weight_shape)


Mark fused-QKV layers with sub-parameter UC metadata

fused_LinearLayer partitions weights with prepare_tp_fused_qkvw() (model-specific Q/K/V reordering), but it inherits this generic LinearLayer metadata path, which records only plain column partitioning and no sub-parameter schema; converter logic then treats these tensors as simple concat-able shards, producing interleaved QKV layouts that are not portable across different TP sizes.

Useful? React with 👍 / 👎.

PawnOfDelock

📝 Review for PR #7908: AutoTP Universal Checkpoint Support

✅ Overall Impression

This is a well-structured and well-documented PR that adds critical functionality for AutoTP universal checkpoint support. The implementation is clean, tests are comprehensive, and CI passes.

🔍 Code Review

1. Design & Architecture 🎯

✅ Good separation of concerns between restore-time and conversion-time metadata
✅ The dual-view metadata design (top-level for restore, nested conversion for model-level aggregation) is elegant
✅ Backward compatibility is properly preserved (non-AutoTP paths unchanged)

2. Implementation Details 💻

universal_checkpoint.py:

✅ _resolve_autotp_partition handles various partition scenarios correctly (row/column, sub-params, replicated)
✅ Clean integration with existing load_hp_checkpoint_state - minimal changes to existing logic
✅ Proper shape normalization and error handling

layers.py:

✅ Consistent _mark_uc_metadata implementation across all TP layer types
✅ collect_autotp_universal_checkpoint_info properly aggregates parameter-level metadata into model-level schema
✅ Regex pattern generation is clean and correct

Optimizers (bf16_optimizer.py, stage_1_and_2.py):

✅ _enable_universal_checkpoint properly caches UC info from parameters
✅ State dict integration is consistent across both optimizer types

engine.py:

✅ Properly collects and attaches UC info to model after AutoTP partitioning
✅ Checkpoint saving includes UC info

3. Testing 🧪

✅ Test coverage is excellent - both unit and integration tests
✅ Tests cover key scenarios:
- Row/column parallel weights
- Subparam partitioning
- Replicated biases
- Metadata aggregation
- Optimizer state handling
✅ Mocking is well-implemented

4. Documentation 📖

✅ Function docstrings are clear and helpful
✅ Code comments explain non-obvious logic
✅ PR description is comprehensive with motivation, implementation details, and testing instructions

🔮 Suggestions & Questions

Edge Cases:
- Consider adding a test for empty sub_param_sizes scenario in _resolve_autotp_partition
- Could add a test for mismatched logical_shape vs output_shape expectations
Performance:
- The metadata collection happens at partition time (good)
- Consider if collect_autotp_universal_checkpoint_info should be cached if called multiple times
Error Handling:
- What happens if AUTOTP_UC_META_KEY exists but is malformed? Could add validation
Future Compatibility:
- The schema design is flexible for future additions - good job

⚠️ Minor Nitpicks

In universal_checkpoint.py, function signature:

def _resolve_autotp_partition(self, ckpt_dict, full_hp_param, tp_rank, tp_world_size):

The self parameter is passed but _resolve_autotp_partition is not a method - consider making it clearer

In layers.py, could regex pattern generation be a utility function to avoid repetition?

These are very minor and don't block approval.

✅ CI Status

✅ All CI checks pass
✅ Tests run successfully
✅ DCO check passes

🎯 Recommendation

LGTM! 🚀

This PR is ready for merge. The implementation is solid, tests are comprehensive, and it properly integrates with existing checkpoint infrastructure.