Skip to content

[WIP] Add: SPMD paged attention example with dual-vector softmax#499

Open
chenshengxin2026 wants to merge 1 commit intohw-native-sys:mainfrom
chenshengxin2026:add-spmd-paged-attention-example
Open

[WIP] Add: SPMD paged attention example with dual-vector softmax#499
chenshengxin2026 wants to merge 1 commit intohw-native-sys:mainfrom
chenshengxin2026:add-spmd-paged-attention-example

Conversation

@chenshengxin2026
Copy link
Copy Markdown
Contributor

Summary

  • Add a complete SPMD paged attention example under examples/a2a3/tensormap_and_ringbuffer/
  • Pipeline: QK matmul → softmax prepare → PV matmul → online update, with dual AIV lanes processing 8-row sub-tiles
  • Includes golden tests with 4 test cases (varying batch sizes and context lengths, bfloat16)

Test plan

  • Run simulation test: python examples/scripts/run_example.py -k examples/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/kernels -g examples/a2a3/tensormap_and_ringbuffer/spmd_paged_attention/golden.py -p sim
  • Verify golden output correctness (RTOL=1e-2, ATOL=1e-2)

Implements a complete paged attention kernel using SPMD parallelism
under the tensormap_and_ringbuffer runtime. The pipeline consists of
QK matmul, softmax prepare, PV matmul, and online update stages with
dual AIV lanes processing 8-row sub-tiles each for the softmax and
accumulation phases.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new SPMD Paged Attention implementation, including Python golden test cases, AICore kernels for QK and PV matrix multiplications, and AIV kernels for softmax preparation and online updates. It also includes orchestration logic to manage these tasks across multiple KV block iterations, utilizing dual-vector subvector partitioning for AIV tasks. The review comments point out a potential readability issue in the AICore matmul kernels where GlobalTensor definitions use a 5-dimensional shape for what appears to be 2D data, requesting clarification on this design choice.

Comment on lines +52 to +54
using GlobalA = GlobalTensor<bfloat16_t, Shape<1, 1, 1, TM, TK>, Stride<TM * TK, TM * TK, TM * TK, TK, 1>>;
using GlobalB = GlobalTensor<bfloat16_t, Shape<1, 1, 1, TK, TN>, Stride<TK * TN, TK * TN, TK * TN, TN, 1>>;
using GlobalOut = GlobalTensor<float, Shape<1, 1, 1, TM, TN>, Stride<TM * TN, TM * TN, TM * TN, TN, 1>>;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The GlobalTensor definitions use Shape<1, 1, 1, TM, TK> and Stride<TM * TK, TM * TK, TM * TK, TK, 1>. While this might be an API requirement, it implies a 5-dimensional tensor where the first three dimensions are always 1. This can reduce readability and potentially obscure the true dimensionality of the data. If the underlying data is truly 2D, consider if the API allows for a more direct 2D tensor definition, or add a comment explaining why these extra dimensions are necessary.

References
  1. In hardware kernel code, using a hardcoded literal for a memory offset is acceptable if its derivation is clearly explained in an accompanying comment. The comment suggests adding an explanation for the 5D tensor definition if it's a necessary hardware-specific layout, aligning with this rule.

Comment on lines +52 to +55
qk_matmul_spmd(__gm__ bfloat16_t *qi_addr, __gm__ bfloat16_t *kj_addr, __gm__ float *sij_addr) {
using GlobalA = GlobalTensor<bfloat16_t, Shape<1, 1, 1, TM, TK>, Stride<TM * TK, TM * TK, TM * TK, TK, 1>>;
using GlobalB =
GlobalTensor<bfloat16_t, Shape<1, 1, 1, TK, TN>, Stride<TK * TN, TK * TN, TK * TN, 1, TK>, Layout::DN>;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to aic_pv_matmul.cpp, the GlobalTensor definitions here use Shape<1, 1, 1, TM, TK> and Stride<TM * TK, TM * TK, TM * TK, TK, 1>. If the tensors are fundamentally 2D, this 5D representation can make the code harder to understand. Please clarify the reason for this structure or simplify it if the pto library allows for direct 2D tensor declarations.

References
  1. In hardware kernel code, using a hardcoded literal for a memory offset is acceptable if its derivation is clearly explained in an accompanying comment. The comment suggests clarifying the reason for the 5D tensor definition if it's a necessary hardware-specific layout, aligning with this rule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant