Shout out to FlyDSL team's effort first.
We are looking at ROCprof ATT source mapping for a FlashAttention kernel. Debug
info is present in the capture, but the line attribution is still too coarse to
navigate the ping-pong / double-buffer schedule from ROCprof Compute Viewer.
References
What we confirmed
- The trace was captured from a fresh
FLYDSL_RUNTIME_CACHE_DIR.
FLYDSL_DEBUG_ENABLE_DEBUG_INFO=1 was set before discovery and ATT capture.
- The captured code object has non-empty DWARF line tables.
code.json maps 2069 / 2070 ISA rows to Python source.
So this does not look like the earlier "missing debug info because a no-debug
HSACO was cached first" failure mode.
Observed behavior
The Python source clearly expresses the schedule we want to inspect:
However, many hot instructions in code.json are attributed only to broad
wrapper or kernel-body locations, for example:
That makes it difficult to jump from ROCprof back to the actual scheduling
sites around coop_dma_k, _next_k_buf_id, coop_dma_v, wait/barrier
placement, LDS reads, and GEMM2 V pre-read.
Questions
- For helpers that eventually emit DMA-to-LDS loads,
s_waitcnt, barriers,
LDS reads, or MFMA calls, should the public wrapper capture the user call
site once and pass loc down to the leaf ODS / inline-asm emitters?
- If a
loc is already provided, should nested helper wrappers avoid calling
_caller_location() again, so they do not remap the op to helper-internal
lines?
- For multi-op helpers, is it better to attribute all emitted ops to the outer
user call site, or would FlyDSL prefer a more structured way to annotate
sub-ops within the helper?
- Are there specific wrappers in the DMA / LDS / MFMA path where maintainers
would welcome targeted source-location propagation patches?
We are not asking to add @traced_op everywhere blindly. We would like to
improve ROCprof source navigation for double-buffer scheduling while avoiding
misleading mappings or behavior changes from eager argument unwrapping. Much thanks!!
Shout out to FlyDSL team's effort first.
We are looking at ROCprof ATT source mapping for a FlashAttention kernel. Debug
info is present in the capture, but the line attribution is still too coarse to
navigate the ping-pong / double-buffer schedule from ROCprof Compute Viewer.
References
code.json: https://github.com/jhinpan/flydsl-kernel-profiling/blob/21f15497bb4a00c81fcb4283e0442977d98a4f71/examples/flash_attn_func/att_viewer/big/ui_output_agent_38430_dispatch_44/code.jsonWhat we confirmed
FLYDSL_RUNTIME_CACHE_DIR.FLYDSL_DEBUG_ENABLE_DEBUG_INFO=1was set before discovery and ATT capture.code.jsonmaps 2069 / 2070 ISA rows to Python source.So this does not look like the earlier "missing debug info because a no-debug
HSACO was cached first" failure mode.
Observed behavior
The Python source clearly expresses the schedule we want to inspect:
https://github.com/jhinpan/flydsl-kernel-profiling/blob/21f15497bb4a00c81fcb4283e0442977d98a4f71/examples/flash_attn_func/att_viewer/big/ui_output_agent_38430_dispatch_44/source_0_flash_attn_func.py#L643-L698
https://github.com/jhinpan/flydsl-kernel-profiling/blob/21f15497bb4a00c81fcb4283e0442977d98a4f71/examples/flash_attn_func/att_viewer/big/ui_output_agent_38430_dispatch_44/source_0_flash_attn_func.py#L970-L980
https://github.com/jhinpan/flydsl-kernel-profiling/blob/21f15497bb4a00c81fcb4283e0442977d98a4f71/examples/flash_attn_func/att_viewer/big/ui_output_agent_38430_dispatch_44/source_0_flash_attn_func.py#L1067-L1112
However, many hot instructions in
code.jsonare attributed only to broadwrapper or kernel-body locations, for example:
flash_attn_func.py:257, the kernel body/decorator area:https://github.com/jhinpan/flydsl-kernel-profiling/blob/21f15497bb4a00c81fcb4283e0442977d98a4f71/examples/flash_attn_func/att_viewer/big/ui_output_agent_38430_dispatch_44/source_0_flash_attn_func.py#L257
flash_attn_func.py:283, the_mfmahelper:https://github.com/jhinpan/flydsl-kernel-profiling/blob/21f15497bb4a00c81fcb4283e0442977d98a4f71/examples/flash_attn_func/att_viewer/big/ui_output_agent_38430_dispatch_44/source_0_flash_attn_func.py#L282-L283
That makes it difficult to jump from ROCprof back to the actual scheduling
sites around
coop_dma_k,_next_k_buf_id,coop_dma_v, wait/barrierplacement, LDS reads, and GEMM2 V pre-read.
Questions
s_waitcnt, barriers,LDS reads, or MFMA calls, should the public wrapper capture the user call
site once and pass
locdown to the leaf ODS / inline-asm emitters?locis already provided, should nested helper wrappers avoid calling_caller_location()again, so they do not remap the op to helper-internallines?
user call site, or would FlyDSL prefer a more structured way to annotate
sub-ops within the helper?
would welcome targeted source-location propagation patches?
We are not asking to add
@traced_opeverywhere blindly. We would like toimprove ROCprof source navigation for double-buffer scheduling while avoiding
misleading mappings or behavior changes from eager argument unwrapping. Much thanks!!