Improve source-location granularity for ROCprof ATT traces

Shout out to FlyDSL team's effort first.

We are looking at ROCprof ATT source mapping for a FlashAttention kernel. Debug
info is present in the capture, but the line attribution is still too coarse to
navigate the ping-pong / double-buffer schedule from ROCprof Compute Viewer.

## References

- Trace bundle: https://github.com/jhinpan/flydsl-kernel-profiling/tree/21f15497bb4a00c81fcb4283e0442977d98a4f71/examples/flash_attn_func
- Analysis report: https://github.com/jhinpan/flydsl-kernel-profiling/blob/21f15497bb4a00c81fcb4283e0442977d98a4f71/examples/flash_attn_func/REPORT.md
- Big ATT `code.json`: https://github.com/jhinpan/flydsl-kernel-profiling/blob/21f15497bb4a00c81fcb4283e0442977d98a4f71/examples/flash_attn_func/att_viewer/big/ui_output_agent_38430_dispatch_44/code.json
- Captured source file: https://github.com/jhinpan/flydsl-kernel-profiling/blob/21f15497bb4a00c81fcb4283e0442977d98a4f71/examples/flash_attn_func/att_viewer/big/ui_output_agent_38430_dispatch_44/source_0_flash_attn_func.py
- Related source-map change we noticed: https://github.com/ROCm/FlyDSL/commit/9f29c0de2ecd6c83061fe79bdc9b39168fae8593

## What we confirmed

- The trace was captured from a fresh `FLYDSL_RUNTIME_CACHE_DIR`.
- `FLYDSL_DEBUG_ENABLE_DEBUG_INFO=1` was set before discovery and ATT capture.
- The captured code object has non-empty DWARF line tables.
- `code.json` maps 2069 / 2070 ISA rows to Python source.

So this does not look like the earlier "missing debug info because a no-debug
HSACO was cached first" failure mode.

## Observed behavior

The Python source clearly expresses the schedule we want to inspect:

- K alternates between current/next LDS buffers:
  https://github.com/jhinpan/flydsl-kernel-profiling/blob/21f15497bb4a00c81fcb4283e0442977d98a4f71/examples/flash_attn_func/att_viewer/big/ui_output_agent_38430_dispatch_44/source_0_flash_attn_func.py#L643-L698
- V is waited on / made visible before GEMM2:
  https://github.com/jhinpan/flydsl-kernel-profiling/blob/21f15497bb4a00c81fcb4283e0442977d98a4f71/examples/flash_attn_func/att_viewer/big/ui_output_agent_38430_dispatch_44/source_0_flash_attn_func.py#L970-L980
- GEMM2 pre-reads V packs before issuing MFMA:
  https://github.com/jhinpan/flydsl-kernel-profiling/blob/21f15497bb4a00c81fcb4283e0442977d98a4f71/examples/flash_attn_func/att_viewer/big/ui_output_agent_38430_dispatch_44/source_0_flash_attn_func.py#L1067-L1112

However, many hot instructions in `code.json` are attributed only to broad
wrapper or kernel-body locations, for example:

- `flash_attn_func.py:257`, the kernel body/decorator area:
  https://github.com/jhinpan/flydsl-kernel-profiling/blob/21f15497bb4a00c81fcb4283e0442977d98a4f71/examples/flash_attn_func/att_viewer/big/ui_output_agent_38430_dispatch_44/source_0_flash_attn_func.py#L257
- `flash_attn_func.py:283`, the `_mfma` helper:
  https://github.com/jhinpan/flydsl-kernel-profiling/blob/21f15497bb4a00c81fcb4283e0442977d98a4f71/examples/flash_attn_func/att_viewer/big/ui_output_agent_38430_dispatch_44/source_0_flash_attn_func.py#L282-L283

That makes it difficult to jump from ROCprof back to the actual scheduling
sites around `coop_dma_k`, `_next_k_buf_id`, `coop_dma_v`, wait/barrier
placement, LDS reads, and GEMM2 V pre-read.

## Questions

1. For helpers that eventually emit DMA-to-LDS loads, `s_waitcnt`, barriers,
   LDS reads, or MFMA calls, should the public wrapper capture the user call
   site once and pass `loc` down to the leaf ODS / inline-asm emitters?
2. If a `loc` is already provided, should nested helper wrappers avoid calling
   `_caller_location()` again, so they do not remap the op to helper-internal
   lines?
3. For multi-op helpers, is it better to attribute all emitted ops to the outer
   user call site, or would FlyDSL prefer a more structured way to annotate
   sub-ops within the helper?
4. Are there specific wrappers in the DMA / LDS / MFMA path where maintainers
   would welcome targeted source-location propagation patches?

We are not asking to add `@traced_op` everywhere blindly. We would like to
improve ROCprof source navigation for double-buffer scheduling while avoiding
misleading mappings or behavior changes from eager argument unwrapping. Much thanks!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve source-location granularity for ROCprof ATT traces #587

References

What we confirmed

Observed behavior

Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Improve source-location granularity for ROCprof ATT traces #587

Description

References

What we confirmed

Observed behavior

Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions