[WIP] Add: SPMD paged attention example with dual-vector softmax by chenshengxin2026 · Pull Request #499 · hw-native-sys/simpler

chenshengxin2026 · 2026-04-09T12:45:56Z

Summary

Add a complete SPMD paged attention example under examples/a2a3/tensormap_and_ringbuffer/
Pipeline: QK matmul → softmax prepare → PV matmul → online update, with dual AIV lanes processing 8-row sub-tiles
Includes golden tests with 4 test cases (varying batch sizes and context lengths, bfloat16)

Test plan

Run simulation test: python examples/scripts/run_example.py -k examples/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/kernels -g examples/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/golden.py -p sim
Verify golden output correctness (RTOL=1e-2, ATOL=1e-2)

Implements a complete paged attention kernel using SPMD parallelism under the tensormap_and_ringbuffer runtime. The pipeline consists of QK matmul, softmax prepare, PV matmul, and online update stages with dual AIV lanes processing 8-row sub-tiles each for the softmax and accumulation phases.

gemini-code-assist

Code Review

This pull request introduces a new SPMD Paged Attention implementation, including Python golden test cases, AICore kernels for QK and PV matrix multiplications, and AIV kernels for softmax preparation and online updates. It also includes orchestration logic to manage these tasks across multiple KV block iterations, utilizing dual-vector subvector partitioning for AIV tasks. The review comments point out a potential readability issue in the AICore matmul kernels where GlobalTensor definitions use a 5-dimensional shape for what appears to be 2D data, requesting clarification on this design choice.

gemini-code-assist · 2026-04-09T13:04:05Z

examples/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/kernels/aic/aic_pv_matmul.cpp

+    using GlobalA = GlobalTensor<bfloat16_t, Shape<1, 1, 1, TM, TK>, Stride<TM * TK, TM * TK, TM * TK, TK, 1>>;
+    using GlobalB = GlobalTensor<bfloat16_t, Shape<1, 1, 1, TK, TN>, Stride<TK * TN, TK * TN, TK * TN, TN, 1>>;
+    using GlobalOut = GlobalTensor<float, Shape<1, 1, 1, TM, TN>, Stride<TM * TN, TM * TN, TM * TN, TN, 1>>;


The GlobalTensor definitions use Shape<1, 1, 1, TM, TK> and Stride<TM * TK, TM * TK, TM * TK, TK, 1>. While this might be an API requirement, it implies a 5-dimensional tensor where the first three dimensions are always 1. This can reduce readability and potentially obscure the true dimensionality of the data. If the underlying data is truly 2D, consider if the API allows for a more direct 2D tensor definition, or add a comment explaining why these extra dimensions are necessary.

References

In hardware kernel code, using a hardcoded literal for a memory offset is acceptable if its derivation is clearly explained in an accompanying comment. The comment suggests adding an explanation for the 5D tensor definition if it's a necessary hardware-specific layout, aligning with this rule.

gemini-code-assist · 2026-04-09T13:04:05Z

examples/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/kernels/aic/aic_qk_matmul.cpp

+qk_matmul_spmd(__gm__ bfloat16_t *qi_addr, __gm__ bfloat16_t *kj_addr, __gm__ float *sij_addr) {
+    using GlobalA = GlobalTensor<bfloat16_t, Shape<1, 1, 1, TM, TK>, Stride<TM * TK, TM * TK, TM * TK, TK, 1>>;
+    using GlobalB =
+        GlobalTensor<bfloat16_t, Shape<1, 1, 1, TK, TN>, Stride<TK * TN, TK * TN, TK * TN, 1, TK>, Layout::DN>;


Similar to aic_pv_matmul.cpp, the GlobalTensor definitions here use Shape<1, 1, 1, TM, TK> and Stride<TM * TK, TM * TK, TM * TK, TK, 1>. If the tensors are fundamentally 2D, this 5D representation can make the code harder to understand. Please clarify the reason for this structure or simplify it if the pto library allows for direct 2D tensor declarations.

References

In hardware kernel code, using a hardcoded literal for a memory offset is acceptable if its derivation is clearly explained in an accompanying comment. The comment suggests clarifying the reason for the 5D tensor definition if it's a necessary hardware-specific layout, aligning with this rule.

gemini-code-assist bot reviewed Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add: SPMD paged attention example with dual-vector softmax#499

[WIP] Add: SPMD paged attention example with dual-vector softmax#499
chenshengxin2026 wants to merge 1 commit intohw-native-sys:mainfrom
chenshengxin2026:add-spmd-paged-attention-example

chenshengxin2026 commented Apr 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

gemini-code-assist bot Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chenshengxin2026 commented Apr 9, 2026

Summary

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant