Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
245 changes: 245 additions & 0 deletions M5_DESIGN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,245 @@
# M5 设计说明

`PTO2_CLUSTER_MODE=5`:在硬件被迫跨 cluster 部署时(如 3+3 cluster),让 PTO2 runtime 的性能恢复到接近"全 4 线程同 cluster"基线。

## 背景

A3 / 未来同代芯片 AICPU 拓扑为 2 个 cluster,cluster 内 cache 一致性远比跨 cluster 廉价。生产配置 `aicpu_thread_num=4`(1 orch + 3 sched)。

**当前 A3 (CANN 9.0.0)** 自然分布是 4+2,gate 选 4 同 cluster ⇒ orch 和 sched 全在一边,没有跨 cluster 开销。这是 M0 baseline。

**未来受限拓扑**(如 cluster 内只 3 核)必然要跨 cluster。先前实验摸清了几种方案:

| Mode | 拓扑 | C1 Orch vs M0 | 备注 |
|---|---|---|---|
| M2 | 1 orch + 3 sched 跨 cluster | +19% | 最差 |
| M3 | 1 orch + 1 sched 同 cluster,2 sched 另 cluster;sched_idx 随机 | +11% | 中间 |
| M4 | 同 M3 拓扑;wiring 线程 (sched_idx 0) 强制跟 orch 同 cluster | +8% | 不够 |
| **M5** | M4 + 软件层约束 SM 写者 + spin cleanup | **+1.5%** | 目标解 |

M4 仅靠 placement 把 wiring queue 留在 orch cluster,但**剩 2 个 sched 仍跨 cluster 写 `last_task_alive`**,orch 每次 alloc 读 SM 都跨 cluster 拉。

## 设计目标

跨 cluster 拓扑下,让 SM 的关键 cache line `last_task_alive` **只被 orch cluster 内的线程写**,消除远端 sched 写带来的跨 cluster invalidate。

## 三个组件

M5 同时做 3 件事,缺一不可。

### 1. sync_to_sm 对非本地 sched 短路

`sync_to_sm()` 是 `advance_ring_pointers()` 末尾把本地 `last_task_alive` 写回 SM 的步骤。M5 加入短路:

```cpp
void sync_to_sm() {
if (g_only_thread0_advances) return; // M5: remote sched 不写 SM
ring->fc.last_task_alive.store(last_task_alive, release);
}
```

`g_only_thread0_advances` 是 thread_local,在 `aicpu_executor::run()` 入口按 `runtime->cluster_mode == 5 && thread_idx != 0` 决定。

效果:

- sched 0 (跟 orch 同 cluster) 正常写 SM
- 远端 2 个 sched 完成 task 时仍然推进**本地 `RingSchedState::last_task_alive`**(这是 RingSchedState 内的 int32_t,所有 sched 共享),但**不把它写回 SM 的 `ring->fc.last_task_alive`**

副作用:SM 滞后于本地。需要 #2 让 sched 0 兜底,需要 #3 让 orch 容忍滞后。

### 2. sched 0 conditional publish

只把"非 sched-0 把 SM 写关掉"不够 —— SM 永远滞后会让 orch 误以为没有 task 完成。需要 sched 0 兜底 publish。

```cpp
// scheduler_dispatch.cpp 的 dispatch 主循环开头:
static thread_local int32_t last_published[PTO2_MAX_RING_DEPTH] = {0,0,0,0};
if (thread_idx == 0 && runtime->cluster_mode == 5) {
for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) {
auto &rss = sched_->ring_sched_states[r];
int32_t cur = rss.last_task_alive;
if (cur != last_published[r]) {
rss.ring->fc.last_task_alive.store(cur, release);
last_published[r] = cur;
}
}
}
```

关键:**conditional**(只在变化时写),避免 unconditional 写带来的 sched 0 自身开销。如果远端 sched 推进了 local,sched 0 看到 `cur != last_published`,publish 一次;否则 0 store。

### 3. new_entry() spin + 主动 cleanup(关键)

只做 #1 #2 会触发原本被掩盖的 bug:**orch 的 tensormap entry pool 复用速度跟不上 entry 释放速度**。

#### 为什么会卡

orch 用 SM 的 `last_task_alive` 触发 tensormap cleanup:

```cpp
sync_tensormap(task_id, sm_last_task_alive) {
cleanup_retired(ring, last_cleanup, sm_last_task_alive);
last_cleanup = sm_last_task_alive;
}
```

在 M5 下 SM 滞后于真实进度 → `cleanup_retired` 清理范围保守 → free_entry_list 没及时补充 → orch 下一次 `new_entry()` 找不到空闲 entry → 走 `always_assert(next_entry_idx < pool_size)` 直接挂。

类似地,alloc / heap reclaim 路径都依赖 SM `last_task_alive`,但 alloc 本身**已经会 spin**(while loop),所以"看到旧值就 spin 等"天然安全。只有 tensormap 这条路是 fire-and-forget。

#### 解法 A:new_entry spin

```cpp
PTO2TensorMapEntry *new_entry() {
while (true) {
if (free_num > 0) return free_entry_list[--free_num];
if (next_entry_idx < pool_size) return &entry_pool[next_entry_idx++];
// 池耗尽:重读 SM 触发新一轮 cleanup
int32_t before = free_num;
try_cleanup_all_rings_from_sm();
if (free_num == before) {
__asm__ volatile("yield"); // 让 sched 0 跑
}
}
}
```

`try_cleanup_all_rings_from_sm()` 重读 SM `last_task_alive` 跑 cleanup,让 orch 主动消化最新进度。

#### 解法 B:cleanup_retired 范围 cap

仅靠 A 还不够。`cleanup_retired` 在范围 ≥ `task_window_size` 时**会 wrap**:

```cpp
for (int32_t local_id = old; local_id < new; ++local_id) {
int32_t slot = local_id & (task_window_size - 1);
// 当 (new - old) >= window,多个 local_id 哈希到同一 slot
// task_entry_heads[slot] 在前一次循环已被清空 → 但被当成新 local 的入口
// 同一组 entry 被 free 两次 → free_entry 的 always_assert 挂
}
```

M5 下 SM 一次可能跳很大(远端 sched 已经 advance 但 sched 0 还没 publish 时,sched 0 一觉醒来 publish 跨度可能 >> window)。Cap 后 cleanup 每次只清 `task_window_size - 1`,多次调用消化:

```cpp
int32_t max_step = task_window_sizes[r] > 0 ? task_window_sizes[r] - 1 : 0;
if (max_step > 0 && target - last_cleanup[r] > max_step) {
target = last_cleanup[r] + max_step;
}
cleanup_retired(r, last_cleanup[r], target);
last_cleanup[r] = target;
```

cap 同时打到 `sync_tensormap` 和新的 `try_cleanup_all_rings_from_sm`,因为 M5 下 sync_tensormap 也可能拿到大跳的 SM 值。

## 不变量

- **本地 `RingSchedState::last_task_alive`** 仍然由所有 sched 在 `advance_ring_pointers` 末尾推进(thread-safe via `advance_lock` try-lock)
- **SM `ring->fc.last_task_alive`** 在 cluster_mode=5 时只由 sched 0 写
- SM 永远 ≤ 本地(因为 sched 0 是从本地拷贝过去的)
- orch 永远拿到 SM 的旧值,按"保守估计"行事;spin 等到 sched 0 推

orch 的三条 SM 消费路径(窗口检查 / heap reclaim / tensormap cleanup)现在统一**容忍旧值 + spin 等待**:

| 路径 | 旧值行为 | M5 改动 |
|---|---|---|
| alloc 窗口 | 觉得窗口紧 → spin | 无改动(本来就 spin) |
| alloc heap | 觉得 heap 不够 → spin | 无改动(本来就 spin) |
| tensormap entry | 之前:assert;现在:spin + 主动 cleanup | new_entry + try_cleanup_all_rings_from_sm |
| tensormap cleanup 范围 | 之前:可能 wrap → double-free;现在:cap 限 1 个 window | sync_tensormap + try_cleanup_all_rings 都加 cap |

## 测量结果

paged_attention_unroll on A3,100 rounds trimmed avg:

| Mode | C1 Orch | C1 Total | C2 Orch | C2 Total |
|---|---|---|---|---|
| M0 (4 同 cluster) | **854** | 1163 | 499 | 626 |
| M4 (1+1+2 placement only) | 919 (+8%) | 1185 (+2%) | 494 | 654 (+4.5%) |
| **M5** | **867 (+1.5%)** | 1180 (+1.5%) | 489 | 649 (+3.6%) |

M5 把 C1 Orch 的 8% regression 收到 1.5% —— 受限拓扑下基本追平 M0。

## 失败的早期版本(学习记录)

### M5 v1:non-sched-0 跳过 advance,置 `pending_advance` flag,sched 0 poll

死。non-sched-0 完全不动本地 last_task_alive,sched 0 自己 idle 时没人推 ring pointer,orch alloc 卡死。

### M5 v2:sync_to_sm 短路 + sched 0 unconditional publish

正确运行但 orch profile TOTAL 反而比 M4 略差。sched 0 每 iter 4 个 atomic store 的开销盖过了 cross-cluster line 留本地的收益。

### M5 v3:sync_to_sm 短路 + sched 0 conditional publish(首次)

assert 挂。SM 滞后时 cleanup 没跟上,pool 用尽 → always_assert。这次失败暴露了**v2 之所以"能跑"只是因为 SM 永远不滞后**。

### M5 v4:v3 + new_entry spin(但 cleanup 没 cap)

assert 挂在 `free_entry` 而不是 `new_entry`。原因:SM 跳跃太大,cleanup_retired 范围跨 `task_window_size` → 同一 entry 被 free 两次。

### M5 v6(终版)

v4 + cleanup_retired 范围 cap。所有路径都正确。

## 适用条件

启用 `PTO2_CLUSTER_MODE=5` 的前提:

1. AICPU 拓扑允许 2+2 分布(每 cluster ≥ 2 个 active 线程)
2. CANN 9.0+(其它版本 affinity gate 行为可能不同)
3. `aicpu_thread_num=4`(其它配置的 thread_idx/role 分配未验证)

在 A3 当前 CANN 上(4+2 ideal),M5 实际上**没必要** —— M0 default 已经把 4 个全放同 cluster,没跨 cluster 开销。M5 是为**未来受限拓扑**准备的解。

## 已知限制

- **sched 0 的 publish 是 lazy 的**:依赖 sched 0 iter 的频率。如果 sched 0 因为各种原因 stall(如长时间没 ready task),SM 会大幅滞后,orch 的 spin 会变长。实测在 paged_attention 上没观察到,但 workload 不同可能暴露。
- **C2 Total 仍比 M0 高 3.6%**:剩余开销主要在 sched 侧,远端 2 个 sched 仍然在 cluster B 上跑完成处理 + 跟 sched 0 同步。这部分是 M5 无法解的本质开销(必须有线程在远端)。
- **`try_cleanup_all_rings_from_sm` 跑全部 4 个 ring**:实际 paged_attention 只用 1 ring,3 次无效循环。可以加 ring 数缓存优化,但影响小。
- **`g_only_thread0_advances` 是 thread_local**:每个 sched 线程都要在 `run()` 入口设置。当前 `aicpu_executor::run()` 已经做了。

## 后续工作

1. 把 `try_cleanup_all_rings_from_sm` 改成 lazy(仅当 pool 真用尽才扫所有 ring)
2. sched 0 publish 可以再加 epoch counter,避免长时间没事时永远不 publish 的极端情况
3. 在真正 3+3 受限硬件上验证 —— A3 现在的 4+2 拓扑只能模拟

## 相关文件

```text
src/a2a3/runtime/tensormap_and_ringbuffer/
├── runtime/
│ ├── scheduler/
│ │ ├── pto_scheduler.h # sync_to_sm 短路;g_only_thread0_advances 声明
│ │ ├── pto_scheduler.cpp # g_only_thread0_advances 定义
│ │ └── scheduler_dispatch.cpp # sched 0 conditional publish
│ ├── pto_tensormap.h # new_entry spin
│ └── shared/
│ └── pto_tensormap.cpp # try_cleanup_all_rings_from_sm + cap
├── aicpu/
│ └── aicpu_executor.cpp # g_only_thread0_advances 赋值
├── runtime/
│ └── runtime.h # cluster_mode 字段
└── host/
└── runtime_maker.cpp # PTO2_CLUSTER_MODE env 读取
Comment on lines +187 to +225
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the documented topology and file references.

This section still anchors M5 to aicpu_thread_num=4 and src/a2a3/..., but the supplied implementation in this PR is under src/a5/... and adds a dedicated wiring-thread role. Leaving the old paths/layout here will send debugging and repro work to the wrong code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@M5_DESIGN.md` around lines 187 - 225, The documentation still pins M5 to
aicpu_thread_num=4 and old src/a2a3 paths; update the M5_DESIGN.md text to
reflect the new implementation under src/a5 and the added wiring-thread role:
replace the aicpu_thread_num=4 claim with the new thread-role layout (mention
the dedicated wiring-thread), change all referenced file names/locations to the
new src/a5 counterparts, and adjust examples/notes that point to former symbols
so they reference the actual implemented symbols (PTO2_CLUSTER_MODE=5,
wiring-thread role, try_cleanup_all_rings_from_sm, g_only_thread0_advances,
pto_scheduler/pto_tensormap/scheduler_dispatch equivalents) so debugging and
repro lead to the correct code.

```

## 复现

```bash
source .venv/bin/activate
pip install --no-build-isolation .

# Baseline (M0)
task-submit --device auto --run \
"cd $(pwd) && source .venv/bin/activate && \
unset PTO2_CLUSTER_MODE && \
bash tools/benchmark_rounds.sh -p a2a3 -d \$TASK_DEVICE -n 100"

# M5
task-submit --device auto --run \
"cd $(pwd) && source .venv/bin/activate && \
export PTO2_CLUSTER_MODE=5 && \
bash tools/benchmark_rounds.sh -p a2a3 -d \$TASK_DEVICE -n 100"
```
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
//
// Optimizations:
// - qi TLOAD hoisted before the loop (constant across all iterations)
// - Double-buffered L1 B tiles: prefetch next kj during current TMATMUL+TSTORE
//
// Supports two tile configurations via runtime dispatch:
// Case1: (16, 128) @ (128, 128).T -> (16, 128)
Expand All @@ -27,7 +28,7 @@
// Template: M=q_tile, K=head_dim, N=block_size

#include <cstdint>
// NOLINTBEGIN(clang-diagnostic-error,bugprone-reserved-identifier,bugprone-easily-swappable-parameters,modernize-avoid-c-arrays,modernize-use-auto)
// NOLINTBEGIN(clang-diagnostic-error,bugprone-reserved-identifier,bugprone-easily-swappable-parameters,modernize-use-auto)
#include <pto/pto-inst.hpp>

#include "tensor.h"
Expand All @@ -42,7 +43,7 @@ using namespace pto;
#endif

#ifndef __aicore__
#define __aicore__ [aicore]
#define __aicore__ [aicore] // NOLINT(whitespace/braces)
#endif

template <int M, int K, int N>
Expand All @@ -61,10 +62,14 @@ static __aicore__ void qk_matmul_n_impl(
using RightTile = TileRight<bfloat16_t, K, N, K, N>;
using AccTile = TileAcc<float, M, N, M, N>;

// Double-buffered L1 B tiles for kj prefetching
constexpr int kBBytes = K * N * static_cast<int>(sizeof(bfloat16_t));
TileMatA aMatTile;
TileMatB bMatTile;
TileMatB bMatTile_A;
TileMatB bMatTile_B;
TASSIGN(aMatTile, 0x0);
TASSIGN(bMatTile, 0x20000);
TASSIGN(bMatTile_A, 0x20000);
TASSIGN(bMatTile_B, 0x20000 + kBBytes);

LeftTile aTile;
RightTile bTile;
Expand All @@ -77,19 +82,34 @@ static __aicore__ void qk_matmul_n_impl(
GlobalA qiGlobal(qi_base);
TLOAD(aMatTile, qiGlobal);

// Pre-load first kj into buffer A
GlobalB kjGlobal_0(key_base + bt[bt_offset + 0] * N * K);
TLOAD(bMatTile_A, kjGlobal_0);

for (uint64_t i = 0; i < n_blocks; i++) {
GlobalB kjGlobal(key_base + bt[bt_offset + i] * N * K);
GlobalOut sijGlobal(sij_base + i * M * N);

// Load only B each iteration (qi already in L1 from hoist)
TLOAD(bMatTile, kjGlobal);

// Wait for current kj TLOAD to complete
set_flag(PIPE_MTE2, PIPE_MTE1, EVENT_ID0);
wait_flag(PIPE_MTE2, PIPE_MTE1, EVENT_ID0);

// TMOV qi from L1→L0A (re-copy since TMATMUL consumed L0A) and kj from L1→L0B
// TMOV qi L1→L0A and kj L1→L0B from current buffer
TMOV(aTile, aMatTile);
TMOV(bTile, bMatTile);
if (i % 2 == 0) {
TMOV(bTile, bMatTile_A);
} else {
TMOV(bTile, bMatTile_B);
}

// Prefetch next kj into alternate L1 buffer (overlaps with MTE1→M→FIX)
if (i + 1 < n_blocks) {
GlobalB kjGlobal_next(key_base + bt[bt_offset + i + 1] * N * K);
if (i % 2 == 0) {
TLOAD(bMatTile_B, kjGlobal_next);
} else {
TLOAD(bMatTile_A, kjGlobal_next);
}
}

set_flag(PIPE_MTE1, PIPE_M, EVENT_ID0);
wait_flag(PIPE_MTE1, PIPE_M, EVENT_ID0);
Expand All @@ -102,10 +122,13 @@ static __aicore__ void qk_matmul_n_impl(
TSTORE(sijGlobal, cTile);

if (i + 1 < n_blocks) {
// Drain all pipes before next iteration:
// - FIX/MTE3: ensures TSTORE data path (L0C→UB→GM) fully completes
// - MTE2: prefetch TLOAD likely already done (ran during TMATMUL+TSTORE)
// The prefetch TLOAD overlaps with compute, so barrier cost is minimal.
pipe_barrier(PIPE_ALL);
}
}

pipe_sync();
}

Expand All @@ -131,4 +154,4 @@ extern "C" __aicore__ void kernel_entry(__gm__ int64_t *args) {
}
}

// NOLINTEND(clang-diagnostic-error,bugprone-reserved-identifier,bugprone-easily-swappable-parameters,modernize-avoid-c-arrays,modernize-use-auto)
// NOLINTEND(clang-diagnostic-error,bugprone-reserved-identifier,bugprone-easily-swappable-parameters,modernize-use-auto)
Loading
Loading