Upstream refactoring plan: contribute ant-pretrain changes to AI-Hypercomputer/maxtext

## 目标

将 ant-pretrain 相对上游 MaxText 的修改（约 241 个提交，~1.5 万行新增代码）重构为 7 个独立 PR，按功能模块拆分，剥离所有私有依赖后推回 AI-Hypercomputer/maxtext 上游。

## 背景

- **Fork 基准**：`d4ed2261`（upstream AI-Hypercomputer/maxtext，2026-01-16）
- **上游已进行大规模目录重构**：`src/MaxText/` → `src/maxtext/`（小写），models 迁移到 `models/` 子目录
- **上游已有部分功能**：MLA（超集实现）、MTP（基础版本）、router bias 基础支持、部分配置字段

## PR 依赖图

```
PR1: 配置扩展 (types.py + base.yml + common_types.py)
 ├── PR2: MoE 对齐与修复 (moe.py, GA, train.py, metric_logger.py)
 ├── PR3: Megatron MMap 数据管道 (新文件 + grain 集成)
 ├── PR4: GLA 线性注意力 (attention_gla.py, gla_pallas.py)
 │    └── PR6: AL Model / Ling2 / Decoder 集成 (依赖 PR4)
 │         ├── PR7: MTP 改进 (修改上游已有文件)
 │         └── PR8: 检查点转换扩展
```

- PR5（MLA）已取消 — 上游版本是我们的严格超集
- PR2-PR4 互相独立，可并行提交

## 各 PR 概要

### PR1: 配置扩展（基础 PR）
- 新增 ~16 个配置字段（MoE z-loss, 线性注意力, 数据管道, 训练模式）
- 新增 `AL_MODEL` / `LING2` DecoderBlockType 枚举

### PR2: MoE 对齐与修复
- Z-loss 计算（ST-MoE 风格 logsumexp）
- Router stats 诊断指标
- Expert count 两步拆分（支持 GA 微批次累积）
- Fp32 精度修复（router gate matmul, scale weights）
- GA aux metrics 归一化 bug fix
- 多 block type MoE 收集（DEEPSEEK/AL_MODEL/LING2/Mixtral/...）

### PR3: Megatron MMap 数据管道
- `MMapIndexedDataset` — Megatron .bin+.idx 格式读取器
- `MegatronNpyDataSource` — 预构建 .npy 索引随机访问
- `MegatronBlendedDataSource` — blend-then-shard 数据集混合
- `GenerateDocSegmentIds` / `MegatronSplitInputsTargets` — segmentation 解耦
- ~3000 行测试代码

### PR4: GLA 线性注意力
- `BailingMoeV2LinearAttention` — Gated Linear Attention (Lightning Attention-2)
- `gla_pallas.py` — Pallas TPU 内核封装（pallas-kernel 作为可选依赖）
- `GroupRMSNorm` — 分组 RMS 归一化

### PR6: AL Model / Ling2 / Decoder 集成
- `ALModel` — 混合 GLA/MLA 注意力 + MoE
- `Ling2` — Ling2 decoder 架构
- Decoder 注册与 scan 集成

### PR7: MTP 改进
- Per-layer 独立 loss 归一化（Megatron-LM 风格）
- `roll_and_mask_by_segment` — 文档边界感知 MTP 滚动
- `final_layernorm` 和 `self.sow()` 改进

### PR8: 检查点转换扩展
- AL Model / NextN 参数映射

## 全局剥离清单

| 剥离项 | 处理方式 |
|--------|---------|
| Argus dump 代码 | 删除所有引用 |
| Megatron-LM 子模块 | 不包含 |
| pallas-kernel 子模块 | 改为可选 pip 依赖 |
| 参考模型代码 (bailing_moe_*) | 不包含 |
| 私有 CI/CD | 不包含 |
| 大型资产 (tokenizer.json) | 不包含 |

## 详细文档

- 设计文档：`docs/plans/2026-03-23-upstream-refactoring-design.md`
- 实施计划（v2）：`docs/plans/2026-03-23-upstream-refactoring-impl.md`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream refactoring plan: contribute ant-pretrain changes to AI-Hypercomputer/maxtext #3481

目标

背景

PR 依赖图

各 PR 概要

PR1: 配置扩展（基础 PR）

PR2: MoE 对齐与修复

PR3: Megatron MMap 数据管道

PR4: GLA 线性注意力

PR6: AL Model / Ling2 / Decoder 集成

PR7: MTP 改进

PR8: 检查点转换扩展

全局剥离清单

详细文档

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

剥离项	处理方式
Argus dump 代码	删除所有引用
Megatron-LM 子模块	不包含
pallas-kernel 子模块	改为可选 pip 依赖
参考模型代码 (bailing_moe_*)	不包含
私有 CI/CD	不包含
大型资产 (tokenizer.json)	不包含

Upstream refactoring plan: contribute ant-pretrain changes to AI-Hypercomputer/maxtext #3481

Description

目标

背景

PR 依赖图

各 PR 概要

PR1: 配置扩展（基础 PR）

PR2: MoE 对齐与修复

PR3: Megatron MMap 数据管道

PR4: GLA 线性注意力

PR6: AL Model / Ling2 / Decoder 集成

PR7: MTP 改进

PR8: 检查点转换扩展

全局剥离清单

详细文档

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions