[Feature] A5 onboard: self-managed perfmon path for L0 swimlane

### Summary

Propose adding L0 (intra-core) swimlane profiling on **a5 onboard (hardware)** via a self-managed path: AICPU directly programs the AICore perfmon hardware registers and points the writeback target buffer at our own per-core GM buffer; drain reuses the existing PMU/L2 `ProfilerBase` buffer→host pipeline.

Scope: a5 **onboard only**. Sim path is out of scope for this issue.

### Motivation / Use Case

A separately prototyped alternative built on the host driver's biu_perf / msprof channel pipeline (HDC consumer ring) was investigated and shown to have structural limits that make it unsuitable as the on-tree L0 path:

- **Channel-count cap**: biu_perf channels only cover **6 physical cores** (`{0, 9, 17, 18, 27, 35}`). Tasks scheduled to other cores silently produce zero L0 data.
- **Driver-paced delivery**: device→host fill cadence is HDC batch-flush, not streaming. Symptoms: "first marker grabs all, subsequent markers return 0"; same scene-test shows **~10× variance** in record count run-to-run.
- **~60s prof_stop teardown** (18 channels × ~3-4s each) on every run.
- **Host-side task-window matching required**: channel data is not synced to our task lifecycle.
- **Knobs don't help**: `prof_start_para.real_time` is coarse-grain only; `sample_period` requires a software `sample_func` (mark_stamp is hardware-driven); `halProfDataFlush` returns `DRV_ERROR_NOT_SUPPORT` for biu_perf channels.

By bypassing the driver/msprof pipeline and programming perfmon ourselves, the data rhythm becomes our rhythm: every covered AICore can be enabled (no 6-core cap), the buffer fills as the kernel runs, drain happens on our task boundary, and the slow `prof_stop` handshake disappears.

### Proposed API / Behavior

Mirror the existing PMU collector architecture (`pmu_collector_aicore.h`, on tree), but replace the per-task software counter readout with a hardware-DMA writeback path.

**AICPU init** — via the existing `write_reg(reg_base, …)` facility. `0xB000` is within the 3MB per-core AICore MMIO window already mapped by `halResMap(RES_AICORE)` (same window PMU at `0x4200` uses), so no new mapping is needed:

| Register | Offset | Action |
| --- | --- | --- |
| `perf_mon_base_addr_l` | `0xB00C` | low 32 bits of per-core GM buffer device address |
| `perf_mon_base_addr_h` | `0xB010` | high 16 bits |
| `perf_mon_buf_len` | TBD | per-core buffer length |
| `perf_mon_samp_crt_clr` | `0xB024` bit0 | write 1 to clear |
| `perf_mon_samp_wrt` | `0xB028` | write 0 to clear |
| `perf_mon_global_en` | `0xB000` bit0 | write 1 to enable (last) |

**Runtime drain** — on AICPU, on task boundary / COND FIN (sibling to L2/PMU drain):

- Read `perf_mon_wptr_o` (`0xB01C`) / `perf_mon_samp_wrt` (`0xB028`) to learn bytes written.
- Push the populated buffer slice through the existing `ProfilerBase` `mgmt_thread` → `collector_thread` pipeline (same path PMU/L2 already use). No additional thread; no `prof_channel_read` call.
- Reset counters and continue.

**Decode** — host side:

- Start with msprof's biu_perf 4-byte chunk format (per `biu_perf_bean.py` / `biu_perf_chip6_parser.py`) as the **starting hypothesis**.
- Empirically validate against raw bytes captured from our own buffer. If the format matches, implement the decoder against our own record structs (do not re-export the msprof interface). If not, redefine the layout from observed bytes.

**Teardown**: clear `global_en`, drain remainder, free per-core buffers. No `prof_stop` handshake → ~60s saved per run.

### Alternatives Considered

- **Drive L0 via biu_perf channels (the prototyped alternative)** — see Motivation. Doesn't meet the on-tree need due to channel cap, HDC jitter, and ~60s teardown.
- **Add tiered profiling dials (#510)** — addresses overhead by lowering collection density, but does not remove the 6-core channel cap, the HDC batch jitter, or the ~60s teardown. Complementary, not a substitute.
- **Wait for a streaming driver API** — out of our control. Driver investigation (below) confirms the open-sourced driver tree only transports opaque bytes; any streaming-API change would have to come from TS firmware (closed).
- **Use the existing PMU collector for pipe-utilization** — PMU samples 10 counters per task: a useful but different signal from `mark_stamp` instruction trace. PMU stays; this issue is about the L0 *trace* path.

### Additional Context

**Driver investigation.** A targeted search of the open-sourced CANN driver tree (host HAL + TS *agent* in `src/sdk_driver/ts_agent/`) confirms it is **not** TS firmware. The perfmon register map (`0xB000-0xB028`), the register programming sequence, and any binary record/chunk decoder are **not in the open-sourced driver tree** — they live in TS firmware. The driver itself never parses AICore trace bytes; `prof_buff.c` / `prof_hdc.c` do raw `memcpy_s` only — no struct casts, no bitfield extraction, no sentinels. ABI structs (`ts_ai_core_profile_config_t`, `tsPCTrace_task_t`) are header-only and unused by any `.c`.

**One useful datum the driver does give us.** Device-side SQE addresses are **SMMU-translated virtual addresses** tagged with `streamid` + `substreamid` (PASID), with `ADDR_UNIFIED`/`ADDR_INDEPENDENT` modes. This makes it likely that `perf_mon_base_addr` is also a device VA under SMMU (not raw physical), encouraging for pointing it at a GM buffer. Residual risk: the BIU/DFX engine may have its own streamid; if so, our GM buffer needs a mapping under that stream's context.

**Open feasibility questions** to resolve before / during prototyping — will be answered empirically by capturing raw bytes from our own buffer:

1. `base_addr` address type (physical vs SMMU-VA; same stream/PASID as kernel writes?)
2. On-wire chunk format vs msprof's 4-byte chunk hypothesis
3. Ring vs linear; `wptr_o` wrap semantics; `samp_wrt` reset model
4. Cache coherency / invalidate model on AICPU read (sibling to `rmb()` model in L2/PMU)
5. Ownership conflict — must ensure driver instr-profiling is fully off so TS firmware does not race us on `base_addr`

**On-tree references (upstream).** Existing infrastructure this proposal builds on:

- AICPU PMU collector pattern to mirror: [pmu_collector_aicore.h](src/a5/platform/include/aicore/pmu_collector_aicore.h) / [pmu_collector_aicpu.cpp](src/a5/platform/src/aicpu/pmu_collector_aicpu.cpp)
- Register access facility already used by PMU: [platform_regs.h](src/a5/platform/include/aicpu/platform_regs.h)

Related: #510, #641

Register	Offset	Action
`perf_mon_base_addr_l`	`0xB00C`	low 32 bits of per-core GM buffer device address
`perf_mon_base_addr_h`	`0xB010`	high 16 bits
`perf_mon_buf_len`	TBD	per-core buffer length
`perf_mon_samp_crt_clr`	`0xB024` bit0	write 1 to clear
`perf_mon_samp_wrt`	`0xB028`	write 0 to clear
`perf_mon_global_en`	`0xB000` bit0	write 1 to enable (last)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] A5 onboard: self-managed perfmon path for L0 swimlane #905

Summary

Motivation / Use Case

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] A5 onboard: self-managed perfmon path for L0 swimlane #905

Description

Summary

Motivation / Use Case

Proposed API / Behavior

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions