feat: add 1F1B schedule #96

JYMiracle305 · 2025-12-11T07:53:55Z

No description provided.

JYMiracle305 · 2025-12-22T09:13:21Z

新增超参数virtual_pipeline_parallel（vpp_size），表示PP场景对stage进行虚拟切分的块数，PP场景将模型切分成pp_size * vpp_size块，分配到对应的设备上；重构后统一不同调度策略对上层的接口，构造调度器PipelineParallelScheduler时根据不同策略填充任务Task表，任务表中保存子任务（关联chunk、microbatch和当前属于正/反向），训练时上层调用StepMicroBatches，StepMicroBatches内部遍历任务表。

virtual_pipeline_parallel为1时，调度表示如下：

virtual_pipeline_parallel大于1时，调度表示如下：

kilinchange · 2025-12-23T13:17:47Z

infini_train/src/nn/parallel/pp/pipeline_schedule.cc

-    float lossf = StepMicroBatches(micro_batches, target_mbs, loss_fn, dtype);
+    LOG(INFO) << "=== Schedule Table ===";
+    LOG(INFO) << "n=" << n << ", stages=" << num_stages << ", vpp=" << vpp_size
+              << ", total_chunks=" << total_global_chunks;


为了增加可读性，用 format 拼字符串吧

kilinchange · 2025-12-23T13:44:38Z

infini_train/include/nn/parallel/pp/pipeline_parallel.h

+struct StageInfo {
+    bool is_first_stage;
+    bool is_last_stage;
+    std::vector<std::pair<int, int>> layer_chunks;


加个注释说明一下这个 vector 里存的是每个 chunk 包含的 layer 的起始位置吧，以及建议改个更直观的名字，比如 chunk_layer_ranges。

kilinchange · 2025-12-23T13:54:43Z

example/llama3/net.cc

+
+        std::vector<std::shared_ptr<nn::Module>> chunk_blocks;
+        int current_index = 0;
+        for (auto it = h_layers->begin(); it != h_layers->end(); ++it, ++current_index) {


给 ModuleList 类型重载一个索引操作吧，这里直接用索引获取对应 layer

kilinchange · 2025-12-23T14:07:56Z

example/gpt2/net.h

    Forward(const std::vector<std::shared_ptr<infini_train::Tensor>> &x) override;
 };

+class GPT2Chunk {


GPT2Chunk 继承 Module，

class GPT2Chunk : public Module { public: GPT2Chunk( GPT2* parent, int layer_begin, int layer_end, bool has_embedding, bool has_lm_head ); std::vector<std::shared_ptr<Tensor>> Forward(const std::vector<std::shared_ptr<Tensor>>& x) override; private: GPT2* parent_ = nullptr; int layer_begin_ = 0; int layer_end_ = 0; bool has_embedding_ = false; bool has_lm_head_ = false; };

这个定义其实对 Transformer 结构都类似，感觉可以提出来成 class TransformerChunk : public Module，然后把定义放到 pp 的文件夹，在各自的 net.cc 里面再定义一个 class GPT2 : public TransformerChunk，仅需要 override 一下 Forward

kilinchange · 2025-12-23T14:16:54Z

infini_train/src/nn/parallel/pp/pipeline_parallel.cc

 }

-std::tuple<bool, bool, int, int> PipelineParallel::GetStageInfo(int total_layers, int pp_size, int pp_rank) {
+StageInfo PipelineParallel::GetStageInfo(int total_layers, int pp_size, int chunks_per_stage) {


pp_rank 还是从 net.cc 传进来，尽量控制 thread_local 变量使用的范围。

使用 thread_local 变量存储 pp_rank/tp_rank 的写法只是一个临时方案，因为线程如果再起子线程不会继承这些变量，所以这是一个不安全的方式，框架里其他地方都尽可能使用 device 里存储的 rank 数据结构获取这些信息，但是在模型初始化的地方 device 尚未创建，所以仅在此处这样做；为了规避 thread_local 大量存在带来的不安全性，后续我们需要开发线程池接管框架所有新起的线程，统一管理把这些 thread_local 变量继承给需要的线程。

kilinchange · 2025-12-23T14:18:17Z

example/gpt2/net.h

    std::vector<std::shared_ptr<infini_train::Tensor>>
    Forward(const std::vector<std::shared_ptr<infini_train::Tensor>> &x) override;

+    void BuildChunks();


BuildChunks 返回 stage 切分后得到的所有 chunk，在构造 PipelineStage 时调用 module 的 BuildChunks 方法，将所有 chunk 存在 PipelineStage 里。

kilinchange · 2025-12-23T14:21:04Z

infini_train/src/nn/parallel/pp/pipeline_stage.cc

-    return model_->Forward(inputs);
+std::vector<std::shared_ptr<Tensor>> PipelineStage::ForwardOneChunk(const std::vector<std::shared_ptr<Tensor>> &inputs,
+                                                                    int local_chunk_idx) {
+    return model_->ForwardChunk(local_chunk_idx, inputs);


这里直接通过 local_chunk_idx 索引获取 stage 存储的 chunk，调用 chunk 的 Forward 方法。

Chamberlain0w0 · 2025-12-24T03:10:42Z

example/gpt2/net.h

    Forward(const std::vector<std::shared_ptr<infini_train::Tensor>> &x) override;
 };

+class GPT2Chunk {


这个定义其实对 Transformer 结构都类似，感觉可以提出来成 class TransformerChunk : public Module，然后把定义放到 pp 的文件夹，在各自的 net.cc 里面再定义一个 class GPT2 : public TransformerChunk，仅需要 override 一下 Forward

Chamberlain0w0 · 2025-12-24T03:13:06Z

example/gpt2/net.cc

-std::vector<std::shared_ptr<infini_train::Tensor>>
-GPT2::Forward(const std::vector<std::shared_ptr<infini_train::Tensor>> &x) {
-    int pp_rank = nn::parallel::pp_rank;
+void GPT2::BuildChunks() {


针对 Transformer 模型的话，BuildChunks 也可以合并，gpt2/llama 仅是一个 pos_emb 的区别，加个 if 判断就可以

kilinchange · 2025-12-25T12:56:43Z

example/gpt2/net.cc

        if (tp_world_size > 1) {
            auto tp_group = nn::parallel::ProcessGroupFactory::Instance()->Get(
-                nn::parallel::GetTensorParallelProcessGroupName(device->rank().GlobalRank()));
-            tp_rank = tp_group->GetGroupRank(device->rank().GlobalRank());


多机时需要用 global rank 获取通信组

kilinchange · 2025-12-25T13:01:08Z

example/gpt2/net.cc

+    int tp_rank = 0;
+    if (tp_world_size > 1) {
+        auto tp_group = nn::parallel::ProcessGroupFactory::Instance()->Get(
+            nn::parallel::GetTensorParallelProcessGroupName(device->rank().thread_rank()));


GlobalRank, 这个文件里其他地方也是，除了 main.cc 里需要传递 device_id 时用 thread_rank，其他地方需要获取通信组时都要传 GlobalRank

kilinchange · 2025-12-25T14:05:52Z

example/gpt2/net.cc

+    auto [is_first_stage, is_last_stage, layer_chunks]
+        = nn::parallel::PipelineParallel::GetStageInfo(n_layer, pp_size, vpp_size);
+    // ========== layer to chunk ==========
+    std::unordered_map<int, bool> owned_layers;


这里感觉没必要用 map，用 vector 就行，查起来还更快
std::vector owned_layers(n_layer, false)

JYMiracle305 force-pushed the add_1F1B branch 3 times, most recently from 496bbfd to 7108a12 Compare December 16, 2025 14:54

JYMiracle305 force-pushed the add_1F1B branch 2 times, most recently from 3726518 to 9af4751 Compare December 22, 2025 09:04

JYMiracle305 force-pushed the add_1F1B branch from 9af4751 to a413a6e Compare December 22, 2025 09:24

JYMiracle305 requested review from Chamberlain0w0 and kilinchange and removed request for kilinchange December 22, 2025 09:35

kilinchange self-requested a review December 22, 2025 14:31

kilinchange requested changes Dec 24, 2025

View reviewed changes

Chamberlain0w0 reviewed Dec 24, 2025

View reviewed changes

JYMiracle305 added 2 commits December 24, 2025 15:46

feat: add 1F1B schedule

f92b7db

feat: implement the task architecture of the PP scheduler

6c54f68

JYMiracle305 force-pushed the add_1F1B branch 3 times, most recently from 0f5628b to aeb8ee0 Compare December 25, 2025 04:59

kilinchange requested changes Dec 25, 2025

View reviewed changes

JYMiracle305 force-pushed the add_1F1B branch 2 times, most recently from f8b086c to c22da40 Compare December 26, 2025 03:21

feat: chunk overrides forward function

213a164

JYMiracle305 force-pushed the add_1F1B branch from c22da40 to 213a164 Compare December 26, 2025 03:33

feat: add 1F1B schedule #96

Are you sure you want to change the base?

feat: add 1F1B schedule #96

Uh oh!

Conversation

JYMiracle305 commented Dec 11, 2025

Uh oh!

JYMiracle305 commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kilinchange Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JYMiracle305 commented Dec 22, 2025 •

edited

Loading

kilinchange Dec 25, 2025 •

edited

Loading