Skip to content

[Feature] A5 onboard: self-managed perfmon path for L0 swimlane #905

@ChaoZheng109

Description

@ChaoZheng109

Summary

Propose adding L0 (intra-core) swimlane profiling on a5 onboard (hardware) via a self-managed path: AICPU directly programs the AICore perfmon hardware registers and points the writeback target buffer at our own per-core GM buffer; drain reuses the existing PMU/L2 ProfilerBase buffer→host pipeline.

Scope: a5 onboard only. Sim path is out of scope for this issue.

Motivation / Use Case

A separately prototyped alternative built on the host driver's biu_perf / msprof channel pipeline (HDC consumer ring) was investigated and shown to have structural limits that make it unsuitable as the on-tree L0 path:

  • Channel-count cap: biu_perf channels only cover 6 physical cores ({0, 9, 17, 18, 27, 35}). Tasks scheduled to other cores silently produce zero L0 data.
  • Driver-paced delivery: device→host fill cadence is HDC batch-flush, not streaming. Symptoms: "first marker grabs all, subsequent markers return 0"; same scene-test shows ~10× variance in record count run-to-run.
  • ~60s prof_stop teardown (18 channels × ~3-4s each) on every run.
  • Host-side task-window matching required: channel data is not synced to our task lifecycle.
  • Knobs don't help: prof_start_para.real_time is coarse-grain only; sample_period requires a software sample_func (mark_stamp is hardware-driven); halProfDataFlush returns DRV_ERROR_NOT_SUPPORT for biu_perf channels.

By bypassing the driver/msprof pipeline and programming perfmon ourselves, the data rhythm becomes our rhythm: every covered AICore can be enabled (no 6-core cap), the buffer fills as the kernel runs, drain happens on our task boundary, and the slow prof_stop handshake disappears.

Proposed API / Behavior

Mirror the existing PMU collector architecture (pmu_collector_aicore.h, on tree), but replace the per-task software counter readout with a hardware-DMA writeback path.

AICPU init — via the existing write_reg(reg_base, …) facility. 0xB000 is within the 3MB per-core AICore MMIO window already mapped by halResMap(RES_AICORE) (same window PMU at 0x4200 uses), so no new mapping is needed:

Register Offset Action
perf_mon_base_addr_l 0xB00C low 32 bits of per-core GM buffer device address
perf_mon_base_addr_h 0xB010 high 16 bits
perf_mon_buf_len TBD per-core buffer length
perf_mon_samp_crt_clr 0xB024 bit0 write 1 to clear
perf_mon_samp_wrt 0xB028 write 0 to clear
perf_mon_global_en 0xB000 bit0 write 1 to enable (last)

Runtime drain — on AICPU, on task boundary / COND FIN (sibling to L2/PMU drain):

  • Read perf_mon_wptr_o (0xB01C) / perf_mon_samp_wrt (0xB028) to learn bytes written.
  • Push the populated buffer slice through the existing ProfilerBase mgmt_threadcollector_thread pipeline (same path PMU/L2 already use). No additional thread; no prof_channel_read call.
  • Reset counters and continue.

Decode — host side:

  • Start with msprof's biu_perf 4-byte chunk format (per biu_perf_bean.py / biu_perf_chip6_parser.py) as the starting hypothesis.
  • Empirically validate against raw bytes captured from our own buffer. If the format matches, implement the decoder against our own record structs (do not re-export the msprof interface). If not, redefine the layout from observed bytes.

Teardown: clear global_en, drain remainder, free per-core buffers. No prof_stop handshake → ~60s saved per run.

Alternatives Considered

  • Drive L0 via biu_perf channels (the prototyped alternative) — see Motivation. Doesn't meet the on-tree need due to channel cap, HDC jitter, and ~60s teardown.
  • Add tiered profiling dials ([Feature] Introduce tiered profiling levels to reduce swimlane collection overhead and measurement distortion #510) — addresses overhead by lowering collection density, but does not remove the 6-core channel cap, the HDC batch jitter, or the ~60s teardown. Complementary, not a substitute.
  • Wait for a streaming driver API — out of our control. Driver investigation (below) confirms the open-sourced driver tree only transports opaque bytes; any streaming-API change would have to come from TS firmware (closed).
  • Use the existing PMU collector for pipe-utilization — PMU samples 10 counters per task: a useful but different signal from mark_stamp instruction trace. PMU stays; this issue is about the L0 trace path.

Additional Context

Driver investigation. A targeted search of the open-sourced CANN driver tree (host HAL + TS agent in src/sdk_driver/ts_agent/) confirms it is not TS firmware. The perfmon register map (0xB000-0xB028), the register programming sequence, and any binary record/chunk decoder are not in the open-sourced driver tree — they live in TS firmware. The driver itself never parses AICore trace bytes; prof_buff.c / prof_hdc.c do raw memcpy_s only — no struct casts, no bitfield extraction, no sentinels. ABI structs (ts_ai_core_profile_config_t, tsPCTrace_task_t) are header-only and unused by any .c.

One useful datum the driver does give us. Device-side SQE addresses are SMMU-translated virtual addresses tagged with streamid + substreamid (PASID), with ADDR_UNIFIED/ADDR_INDEPENDENT modes. This makes it likely that perf_mon_base_addr is also a device VA under SMMU (not raw physical), encouraging for pointing it at a GM buffer. Residual risk: the BIU/DFX engine may have its own streamid; if so, our GM buffer needs a mapping under that stream's context.

Open feasibility questions to resolve before / during prototyping — will be answered empirically by capturing raw bytes from our own buffer:

  1. base_addr address type (physical vs SMMU-VA; same stream/PASID as kernel writes?)
  2. On-wire chunk format vs msprof's 4-byte chunk hypothesis
  3. Ring vs linear; wptr_o wrap semantics; samp_wrt reset model
  4. Cache coherency / invalidate model on AICPU read (sibling to rmb() model in L2/PMU)
  5. Ownership conflict — must ensure driver instr-profiling is fully off so TS firmware does not race us on base_addr

On-tree references (upstream). Existing infrastructure this proposal builds on:

Related: #510, #641

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions