[WIP] Add: SPMD paged attention example with dual-vector softmax#499
[WIP] Add: SPMD paged attention example with dual-vector softmax#499chenshengxin2026 wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
Implements a complete paged attention kernel using SPMD parallelism under the tensormap_and_ringbuffer runtime. The pipeline consists of QK matmul, softmax prepare, PV matmul, and online update stages with dual AIV lanes processing 8-row sub-tiles each for the softmax and accumulation phases.
There was a problem hiding this comment.
Code Review
This pull request introduces a new SPMD Paged Attention implementation, including Python golden test cases, AICore kernels for QK and PV matrix multiplications, and AIV kernels for softmax preparation and online updates. It also includes orchestration logic to manage these tasks across multiple KV block iterations, utilizing dual-vector subvector partitioning for AIV tasks. The review comments point out a potential readability issue in the AICore matmul kernels where GlobalTensor definitions use a 5-dimensional shape for what appears to be 2D data, requesting clarification on this design choice.
| using GlobalA = GlobalTensor<bfloat16_t, Shape<1, 1, 1, TM, TK>, Stride<TM * TK, TM * TK, TM * TK, TK, 1>>; | ||
| using GlobalB = GlobalTensor<bfloat16_t, Shape<1, 1, 1, TK, TN>, Stride<TK * TN, TK * TN, TK * TN, TN, 1>>; | ||
| using GlobalOut = GlobalTensor<float, Shape<1, 1, 1, TM, TN>, Stride<TM * TN, TM * TN, TM * TN, TN, 1>>; |
There was a problem hiding this comment.
The GlobalTensor definitions use Shape<1, 1, 1, TM, TK> and Stride<TM * TK, TM * TK, TM * TK, TK, 1>. While this might be an API requirement, it implies a 5-dimensional tensor where the first three dimensions are always 1. This can reduce readability and potentially obscure the true dimensionality of the data. If the underlying data is truly 2D, consider if the API allows for a more direct 2D tensor definition, or add a comment explaining why these extra dimensions are necessary.
References
- In hardware kernel code, using a hardcoded literal for a memory offset is acceptable if its derivation is clearly explained in an accompanying comment. The comment suggests adding an explanation for the 5D tensor definition if it's a necessary hardware-specific layout, aligning with this rule.
| qk_matmul_spmd(__gm__ bfloat16_t *qi_addr, __gm__ bfloat16_t *kj_addr, __gm__ float *sij_addr) { | ||
| using GlobalA = GlobalTensor<bfloat16_t, Shape<1, 1, 1, TM, TK>, Stride<TM * TK, TM * TK, TM * TK, TK, 1>>; | ||
| using GlobalB = | ||
| GlobalTensor<bfloat16_t, Shape<1, 1, 1, TK, TN>, Stride<TK * TN, TK * TN, TK * TN, 1, TK>, Layout::DN>; |
There was a problem hiding this comment.
Similar to aic_pv_matmul.cpp, the GlobalTensor definitions here use Shape<1, 1, 1, TM, TK> and Stride<TM * TK, TM * TK, TM * TK, TK, 1>. If the tensors are fundamentally 2D, this 5D representation can make the code harder to understand. Please clarify the reason for this structure or simplify it if the pto library allows for direct 2D tensor declarations.
References
- In hardware kernel code, using a hardcoded literal for a memory offset is acceptable if its derivation is clearly explained in an accompanying comment. The comment suggests clarifying the reason for the 5D tensor definition if it's a necessary hardware-specific layout, aligning with this rule.
Summary
examples/a2a3/tensormap_and_ringbuffer/Test plan
python examples/scripts/run_example.py -k examples/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/kernels -g examples/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/golden.py -p sim