【训练营】Checkpoint 读取工具 by ArcaLunar · Pull Request #129 · InfiniTensor/InfiniTrain

ArcaLunar · 2026-03-17T02:20:03Z

Checkpoint 读取工具主要参数：

--checkpoint_dir 训练过程中的保存目录
--save_steps 每 N 次保存一次，设置为 0 则不保存
--max_checkpoint_keep 最多保留 K 个 checkpoint
--save_optimizer_state 是否保存优化器的状态
--resume_from 从指定 checkpoint 目录恢复训练

Checkpoint 文件可以通过从 /data/shared/....../llmc/gpt2 (or llama3) 的原始模型参数训练而来，例子可见仓库中的 REPORT.md（Experiment 实际上也测试了llama3，但是命令只记录了 GPT2 训练），model.bin, optimizer.bin, trainer_state.json 都可以从训练中获取．因此不在附件中提供

Experiment

CUDA_VISIBLE_DEVICES=5,6,7 ./gpt2 --input_bin ../../data/llmc/gpt2/tinyshakespeare/tiny_shakespeare_train.bin --llmc_filepath ../../data/llmc/gpt2/gpt2_124M.bin --checkpoint_dir ../ckpt2/gpt2-noresume/ --num_iteration 100 --save_steps 20 --save_optimizer_state true --max_checkpoint_keep 10

CUDA_VISIBLE_DEVICES=5,6,7 ./gpt2 --input_bin ../../data/llmc/gpt2/tinyshakespeare/tiny_shakespeare_train.bin --llmc_filepath ../../data/llmc/gpt2/gpt2_124M.bin --checkpoint_dir ../ckpt2/gpt2-resumefrom40/ --num_iteration 100 --save_steps 20 --save_optimizer_state true --max_checkpoint_keep 10 --resume_from ../ckpt2/gpt2-noresume/checkpoint_step_000040/ > ../ckpt2/gpt2-resumefrom40/gpt2-resume.log 2>&1

（以上两条训练命令同样用 llama3 也运行了）

运行 compare_loss.py，对于 llama3 模型，由于从 step 40 恢复训练，所以 step 1~40 数据缺失，而其余 60 步的 loss 在 FP32, BF16 下均吻合

  Summary: 60/100 steps matched

==================================================
Overall Summary:
  fp32:    0/1 test cases passed (threshold: 1e-05)
  bfloat16: 0/0 test cases passed (threshold: 1e-02)
  Total:   0/1 test cases passed
==================================================

==================================================
Overall Summary:
  fp32:    0/0 test cases passed (threshold: 1e-05)
  bfloat16: 0/1 test cases passed (threshold: 1e-02)
  Total:   0/1 test cases passed
==================================================

对于 GPT2，模型保存的逻辑有误：训练中 lm_head 与 wte 并非真共享，而 LLMC 存取又按“共享”假设处理，resume 后 lm_head 很容易和 noresume 不一致。解决方法是把训练用 checkpoint 从 LLMC 回调路径切到原生 StateDict 二进制路径，并在加载后显式重建权重绑定语义 (example/gpt2/main.cc)．经过修复后，也可以通过．

JYMiracle305 · 2026-03-17T02:28:12Z

请把所有commit信息rebase成一个
另外，解决一下当前的代码冲突

ArcaLunar · 2026-03-17T03:17:09Z

请把所有commit信息rebase成一个

另外，解决一下当前的代码冲突

已整理 commit 历史（保留上游 merge）+解决冲突

JYMiracle305 · 2026-03-17T06:39:58Z

请把所有commit信息rebase成一个

另外，解决一下当前的代码冲突

已整理 commit 历史（保留上游 merge）+解决冲突

需要解决一下format check报错

ArcaLunar · 2026-03-17T09:41:43Z

请把所有commit信息rebase成一个

另外，解决一下当前的代码冲突

已整理 commit 历史（保留上游 merge）+解决冲突

需要解决一下format check报错

已对 example/ 和 infini_train/ 进行 format

ArcaLunar · 2026-03-17T13:29:36Z

本地是过的，不知道为什么 check 没过，难道是 clang-format 版本的问题吗

format: use clang-format-16 instead

ArcaLunar · 2026-03-17T13:53:45Z

本地是过的，不知道为什么 check 没过，难道是 clang-format 版本的问题吗

找到问题了，还真是 clang-format 的问题，通过 pip 安装 clang-format-16 还真有没有 format 的文件。重新 commit 了一下应该好了

JYMiracle305 · 2026-03-18T07:36:55Z


+    int start_step = 0;
+    float best_loss = std::numeric_limits<float>::infinity();
+    if (!FLAGS_resume_from.empty()) {


建议把主流程中恢复、保存、清理旧的Checkpoint提成公共函数，尽量让主流程简洁，另外各个训练入口可以复用。

参数如果太多可以用struct整合在一起

将 main.cc 中的恢复过程提取为 infini_train::ResumeFromCheckpoint()，并通过 std::tie() 获取 start_step 等信息，通过引用传递实现参数恢复。使用 llama3 进行简单测试，loss 可以复现．

JYMiracle305 · 2026-03-18T13:40:35Z

+DEFINE_string(checkpoint_dir, "./checkpoints", "root directory used to store checkpoints");
+DEFINE_uint32(max_checkpoint_keep, 3, "max number of checkpoint steps to keep");
+DEFINE_bool(save_optimizer_state, true, "whether optimizer state is persisted in checkpoints");
+DEFINE_string(checkpoint_format, "bin", "checkpoint format: bin|pth");


在llama3的训练中没有设置use_llmc_checkpoint_io，这个是出于什么考虑还是漏了

其实应该是没有这个 use_llmc_checkpoint_io 的，但是因为 GPT2 FromLLMC() 里的 TODO 说明了wte 和 lm_head 不是真共享权重，而 FromLLMC() 不能动因为要读取原始的 gpt2_124M.bin 数据．SaveAsLLMC() 没有写入 lm_head 也是为了和原始 LLMC 文件兼容．然而这样的话，恢复训练就会导致权重不对，从而导致 loss 不一致．所以在 GPT2 训练时加了这个参数用 fallback save & load.

那现在这个设计，LLaMA3进行save和load主要依赖格式FLAGS_checkpoint_format，而GPT2同时依赖FLAGS_checkpoint_format和FLAGS_use_llmc_checkpoint_io，这么看FLAGS_use_llmc_checkpoint_io是否有必要，统一使用文件格式的标志作为控制即可。在GPT2::SaveAsLLMC和FromLLMC()先进行同样的说明。

已去掉 gpt2/main.cc 中 use_llmc_checkpoint_io

remove redundent arguments

JYMiracle305 · 2026-03-25T14:39:10Z

    ifs.seekg(base + std::streamoff(len * sizeof(float)));
 }

+std::tuple<int, float, size_t> ResumeFromCheckpoint(


这个参数和返回值都用struct表示吧

已修改，用 struct 打包了输入输出

JYMiracle305 · 2026-03-25T14:42:03Z

+    std::tie(start_step, best_loss, saved_data_batch_idx) = infini_train::ResumeFromCheckpoint(
+        FLAGS_resume_from, rank, model, optimizer, train_loader, state, train_iter, load_options);
+
+    auto save_checkpoint = [&](const std::filesystem::path &save_dir, int64_t global_step,


这个表达式内部也可以提一个函数（类似 SaveCheckpoint）到utils.cc，内部只构造一个参数的struct，然后调用SaveCheckpoint

将 save_checkpoint 的逻辑提取为 utils.h:SaveCheckpoint()，主程序中的 lambda 函数只用于获取信息、构造参数并调用函数

format files

ArcaLunar added 2 commits March 17, 2026 11:03

feat: checkpoint save & load

146bd1d

merge upstream/master

3b13af4

ArcaLunar force-pushed the master branch from d51daf3 to 3b13af4 Compare March 17, 2026 03:11

kilinchange requested a review from JYMiracle305 March 17, 2026 06:16

kilinchange assigned JYMiracle305 Mar 17, 2026

format: format files in examples and infini_train

4b248b6

format: use clang-format-16 instead

ArcaLunar force-pushed the master branch from 9c3e38c to 4b248b6 Compare March 17, 2026 13:52

JYMiracle305 reviewed Mar 18, 2026

View reviewed changes

ArcaLunar force-pushed the master branch from b453988 to 2537e48 Compare March 23, 2026 11:35

feat: extract resuming to utils

08ed56b

remove redundent arguments

ArcaLunar force-pushed the master branch from 2537e48 to 08ed56b Compare March 23, 2026 11:40

JYMiracle305 reviewed Mar 25, 2026

View reviewed changes

feat: extract similar logic in ckpt_save

97bd747

format files

ArcaLunar force-pushed the master branch from f7e5cd7 to 97bd747 Compare March 27, 2026 06:29

Conversation

ArcaLunar commented Mar 17, 2026

Experiment

Uh oh!

JYMiracle305 commented Mar 17, 2026

Uh oh!

ArcaLunar commented Mar 17, 2026

Uh oh!

JYMiracle305 commented Mar 17, 2026

Uh oh!

ArcaLunar commented Mar 17, 2026

Uh oh!

ArcaLunar commented Mar 17, 2026

Uh oh!

ArcaLunar commented Mar 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants