You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Propose adding L0 (intra-core) swimlane profiling on a5 onboard (hardware) via a self-managed path: AICPU directly programs the AICore perfmon hardware registers and points the writeback target buffer at our own per-core GM buffer; drain reuses the existing PMU/L2 ProfilerBase buffer→host pipeline.
Scope: a5 onboard only. Sim path is out of scope for this issue.
Motivation / Use Case
A separately prototyped alternative built on the host driver's biu_perf / msprof channel pipeline (HDC consumer ring) was investigated and shown to have structural limits that make it unsuitable as the on-tree L0 path:
Channel-count cap: biu_perf channels only cover 6 physical cores ({0, 9, 17, 18, 27, 35}). Tasks scheduled to other cores silently produce zero L0 data.
Driver-paced delivery: device→host fill cadence is HDC batch-flush, not streaming. Symptoms: "first marker grabs all, subsequent markers return 0"; same scene-test shows ~10× variance in record count run-to-run.
~60s prof_stop teardown (18 channels × ~3-4s each) on every run.
Host-side task-window matching required: channel data is not synced to our task lifecycle.
Knobs don't help: prof_start_para.real_time is coarse-grain only; sample_period requires a software sample_func (mark_stamp is hardware-driven); halProfDataFlush returns DRV_ERROR_NOT_SUPPORT for biu_perf channels.
By bypassing the driver/msprof pipeline and programming perfmon ourselves, the data rhythm becomes our rhythm: every covered AICore can be enabled (no 6-core cap), the buffer fills as the kernel runs, drain happens on our task boundary, and the slow prof_stop handshake disappears.
Proposed API / Behavior
Mirror the existing PMU collector architecture (pmu_collector_aicore.h, on tree), but replace the per-task software counter readout with a hardware-DMA writeback path.
AICPU init — via the existing write_reg(reg_base, …) facility. 0xB000 is within the 3MB per-core AICore MMIO window already mapped by halResMap(RES_AICORE) (same window PMU at 0x4200 uses), so no new mapping is needed:
Register
Offset
Action
perf_mon_base_addr_l
0xB00C
low 32 bits of per-core GM buffer device address
perf_mon_base_addr_h
0xB010
high 16 bits
perf_mon_buf_len
TBD
per-core buffer length
perf_mon_samp_crt_clr
0xB024 bit0
write 1 to clear
perf_mon_samp_wrt
0xB028
write 0 to clear
perf_mon_global_en
0xB000 bit0
write 1 to enable (last)
Runtime drain — on AICPU, on task boundary / COND FIN (sibling to L2/PMU drain):
Read perf_mon_wptr_o (0xB01C) / perf_mon_samp_wrt (0xB028) to learn bytes written.
Push the populated buffer slice through the existing ProfilerBasemgmt_thread → collector_thread pipeline (same path PMU/L2 already use). No additional thread; no prof_channel_read call.
Reset counters and continue.
Decode — host side:
Start with msprof's biu_perf 4-byte chunk format (per biu_perf_bean.py / biu_perf_chip6_parser.py) as the starting hypothesis.
Empirically validate against raw bytes captured from our own buffer. If the format matches, implement the decoder against our own record structs (do not re-export the msprof interface). If not, redefine the layout from observed bytes.
Teardown: clear global_en, drain remainder, free per-core buffers. No prof_stop handshake → ~60s saved per run.
Alternatives Considered
Drive L0 via biu_perf channels (the prototyped alternative) — see Motivation. Doesn't meet the on-tree need due to channel cap, HDC jitter, and ~60s teardown.
Wait for a streaming driver API — out of our control. Driver investigation (below) confirms the open-sourced driver tree only transports opaque bytes; any streaming-API change would have to come from TS firmware (closed).
Use the existing PMU collector for pipe-utilization — PMU samples 10 counters per task: a useful but different signal from mark_stamp instruction trace. PMU stays; this issue is about the L0 trace path.
Additional Context
Driver investigation. A targeted search of the open-sourced CANN driver tree (host HAL + TS agent in src/sdk_driver/ts_agent/) confirms it is not TS firmware. The perfmon register map (0xB000-0xB028), the register programming sequence, and any binary record/chunk decoder are not in the open-sourced driver tree — they live in TS firmware. The driver itself never parses AICore trace bytes; prof_buff.c / prof_hdc.c do raw memcpy_s only — no struct casts, no bitfield extraction, no sentinels. ABI structs (ts_ai_core_profile_config_t, tsPCTrace_task_t) are header-only and unused by any .c.
One useful datum the driver does give us. Device-side SQE addresses are SMMU-translated virtual addresses tagged with streamid + substreamid (PASID), with ADDR_UNIFIED/ADDR_INDEPENDENT modes. This makes it likely that perf_mon_base_addr is also a device VA under SMMU (not raw physical), encouraging for pointing it at a GM buffer. Residual risk: the BIU/DFX engine may have its own streamid; if so, our GM buffer needs a mapping under that stream's context.
Open feasibility questions to resolve before / during prototyping — will be answered empirically by capturing raw bytes from our own buffer:
base_addr address type (physical vs SMMU-VA; same stream/PASID as kernel writes?)
On-wire chunk format vs msprof's 4-byte chunk hypothesis
Ring vs linear; wptr_o wrap semantics; samp_wrt reset model
Cache coherency / invalidate model on AICPU read (sibling to rmb() model in L2/PMU)
Ownership conflict — must ensure driver instr-profiling is fully off so TS firmware does not race us on base_addr
On-tree references (upstream). Existing infrastructure this proposal builds on:
Summary
Propose adding L0 (intra-core) swimlane profiling on a5 onboard (hardware) via a self-managed path: AICPU directly programs the AICore perfmon hardware registers and points the writeback target buffer at our own per-core GM buffer; drain reuses the existing PMU/L2
ProfilerBasebuffer→host pipeline.Scope: a5 onboard only. Sim path is out of scope for this issue.
Motivation / Use Case
A separately prototyped alternative built on the host driver's biu_perf / msprof channel pipeline (HDC consumer ring) was investigated and shown to have structural limits that make it unsuitable as the on-tree L0 path:
{0, 9, 17, 18, 27, 35}). Tasks scheduled to other cores silently produce zero L0 data.prof_start_para.real_timeis coarse-grain only;sample_periodrequires a softwaresample_func(mark_stamp is hardware-driven);halProfDataFlushreturnsDRV_ERROR_NOT_SUPPORTfor biu_perf channels.By bypassing the driver/msprof pipeline and programming perfmon ourselves, the data rhythm becomes our rhythm: every covered AICore can be enabled (no 6-core cap), the buffer fills as the kernel runs, drain happens on our task boundary, and the slow
prof_stophandshake disappears.Proposed API / Behavior
Mirror the existing PMU collector architecture (
pmu_collector_aicore.h, on tree), but replace the per-task software counter readout with a hardware-DMA writeback path.AICPU init — via the existing
write_reg(reg_base, …)facility.0xB000is within the 3MB per-core AICore MMIO window already mapped byhalResMap(RES_AICORE)(same window PMU at0x4200uses), so no new mapping is needed:perf_mon_base_addr_l0xB00Cperf_mon_base_addr_h0xB010perf_mon_buf_lenperf_mon_samp_crt_clr0xB024bit0perf_mon_samp_wrt0xB028perf_mon_global_en0xB000bit0Runtime drain — on AICPU, on task boundary / COND FIN (sibling to L2/PMU drain):
perf_mon_wptr_o(0xB01C) /perf_mon_samp_wrt(0xB028) to learn bytes written.ProfilerBasemgmt_thread→collector_threadpipeline (same path PMU/L2 already use). No additional thread; noprof_channel_readcall.Decode — host side:
biu_perf_bean.py/biu_perf_chip6_parser.py) as the starting hypothesis.Teardown: clear
global_en, drain remainder, free per-core buffers. Noprof_stophandshake → ~60s saved per run.Alternatives Considered
mark_stampinstruction trace. PMU stays; this issue is about the L0 trace path.Additional Context
Driver investigation. A targeted search of the open-sourced CANN driver tree (host HAL + TS agent in
src/sdk_driver/ts_agent/) confirms it is not TS firmware. The perfmon register map (0xB000-0xB028), the register programming sequence, and any binary record/chunk decoder are not in the open-sourced driver tree — they live in TS firmware. The driver itself never parses AICore trace bytes;prof_buff.c/prof_hdc.cdo rawmemcpy_sonly — no struct casts, no bitfield extraction, no sentinels. ABI structs (ts_ai_core_profile_config_t,tsPCTrace_task_t) are header-only and unused by any.c.One useful datum the driver does give us. Device-side SQE addresses are SMMU-translated virtual addresses tagged with
streamid+substreamid(PASID), withADDR_UNIFIED/ADDR_INDEPENDENTmodes. This makes it likely thatperf_mon_base_addris also a device VA under SMMU (not raw physical), encouraging for pointing it at a GM buffer. Residual risk: the BIU/DFX engine may have its own streamid; if so, our GM buffer needs a mapping under that stream's context.Open feasibility questions to resolve before / during prototyping — will be answered empirically by capturing raw bytes from our own buffer:
base_addraddress type (physical vs SMMU-VA; same stream/PASID as kernel writes?)wptr_owrap semantics;samp_wrtreset modelrmb()model in L2/PMU)base_addrOn-tree references (upstream). Existing infrastructure this proposal builds on:
Related: #510, #641