Skip to content

Improve source-location granularity for ROCprof ATT traces #587

@jhinpan

Description

@jhinpan

Shout out to FlyDSL team's effort first.

We are looking at ROCprof ATT source mapping for a FlashAttention kernel. Debug
info is present in the capture, but the line attribution is still too coarse to
navigate the ping-pong / double-buffer schedule from ROCprof Compute Viewer.

References

What we confirmed

  • The trace was captured from a fresh FLYDSL_RUNTIME_CACHE_DIR.
  • FLYDSL_DEBUG_ENABLE_DEBUG_INFO=1 was set before discovery and ATT capture.
  • The captured code object has non-empty DWARF line tables.
  • code.json maps 2069 / 2070 ISA rows to Python source.

So this does not look like the earlier "missing debug info because a no-debug
HSACO was cached first" failure mode.

Observed behavior

The Python source clearly expresses the schedule we want to inspect:

However, many hot instructions in code.json are attributed only to broad
wrapper or kernel-body locations, for example:

That makes it difficult to jump from ROCprof back to the actual scheduling
sites around coop_dma_k, _next_k_buf_id, coop_dma_v, wait/barrier
placement, LDS reads, and GEMM2 V pre-read.

Questions

  1. For helpers that eventually emit DMA-to-LDS loads, s_waitcnt, barriers,
    LDS reads, or MFMA calls, should the public wrapper capture the user call
    site once and pass loc down to the leaf ODS / inline-asm emitters?
  2. If a loc is already provided, should nested helper wrappers avoid calling
    _caller_location() again, so they do not remap the op to helper-internal
    lines?
  3. For multi-op helpers, is it better to attribute all emitted ops to the outer
    user call site, or would FlyDSL prefer a more structured way to annotate
    sub-ops within the helper?
  4. Are there specific wrappers in the DMA / LDS / MFMA path where maintainers
    would welcome targeted source-location propagation patches?

We are not asking to add @traced_op everywhere blindly. We would like to
improve ROCprof source navigation for double-buffer scheduling while avoiding
misleading mappings or behavior changes from eager argument unwrapping. Much thanks!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions