Skip to content

Draft: Port hw-native flash attention example to PTODSL#449

Draft
jimmychou0 wants to merge 2 commits into
mouliangyu:feature-vpto-backendfrom
jimmychou0:zjm/hw-native-fa-ptodsl
Draft

Draft: Port hw-native flash attention example to PTODSL#449
jimmychou0 wants to merge 2 commits into
mouliangyu:feature-vpto-backendfrom
jimmychou0:zjm/hw-native-fa-ptodsl

Conversation

@jimmychou0
Copy link
Copy Markdown
Collaborator

Summary

  • Port the hw-native Python flash-attention example idea to the PTODSL API in this PTOAS repo
  • Document the old pto-dsl -> current ptodsl API mapping and usage flow
  • Add compile and PTOAS frontend verification coverage

Scope

  • Compile/frontend only
  • No A5/A3 runtime launch or backend lowering changes
  • Keeps the existing QK -> online softmax -> PV -> output blend dataflow

Validation

  • Windows: python -m py_compile for the updated example/tests
  • Windows: git diff --check
  • dev-481211: python3 ptodsl/tests/test_flash_attention_demo_compile.py
  • dev-481211: python3 ptodsl/tests/test_flash_attention_frontend_verify.py
  • dev-481211: python3 ptodsl/examples/flash_attention_sketch.py --block-q 64 --block-kv 128 -o /tmp/hwfa_ptodsl.mlir && ptoas /tmp/hwfa_ptodsl.mlir --emit-pto-ir -o /tmp/hwfa_ptodsl.pto

@jimmychou0 jimmychou0 force-pushed the zjm/hw-native-fa-ptodsl branch 10 times, most recently from 23a5955 to 8a19a6e Compare May 29, 2026 02:34
if CAUSAL:
raise ValueError("causal masking is not part of the hw-native source port yet")

c0 = pto.const(0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前PTODSL支持直接写字面量,不用pto.const构造


qk_slot_ptr = gm_slot_buffer
pv_slot_ptr = pto.addptr(gm_slot_buffer, gm_pv_off_f32)
p_slot_ptr = pto.addptr(gm_slot_buffer_fp16, gm_p_off_f32 * 2)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make_tensor_view可以传offset参数,不用在这里构造ptr

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

说错了,可以用partition_view + offset构造slice,更加符合PTO的style

blayout="RowMajor",
slayout="ColMajor",
)
k_right = pto.alloc_tile(shape=[HEAD, CUBE_S1], dtype=pto.f16, memory_space=pto.MemorySpace.RIGHT)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MemorySpace可以直接写字符串或Enum,字符串会简洁一些

)
for static_row_slice in range(row_slice_count):
with pto.if_(row_slice == pto.const(static_row_slice)) as br:
with br.then_:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前的branch语法还是太折磨了,后面想想有没有改进方案

}


def emit_flash_attention_mlir(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数名可以统一改一下,不用叫emit_xxx吧,正常表达计算逻辑就行了



@pto.simd
def finalize_and_store_output(o_tile: pto.Tile, running_sum: pto.Tile):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个其实不用封装成pto.simd


def emit_softmax(tile_id, ring_id, is_init):
slot_id = tile_id % cSLOT_NUM
with pto.for_(0, row_slice_count, step=1) as row_slice:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个softmax是按行单独做softmax的,我感觉不是很高效。后面我们会替换成另一个版本

@jimmychou0 jimmychou0 force-pushed the zjm/hw-native-fa-ptodsl branch 7 times, most recently from 1ff1dd8 to 75bd970 Compare May 30, 2026 02:27
@jimmychou0 jimmychou0 force-pushed the zjm/hw-native-fa-ptodsl branch from 75bd970 to 74d0d1c Compare May 30, 2026 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants