Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -572,7 +572,7 @@ def sort_key(item):
items.sort(key=sort_key)

# L3 perf collection is not supported yet: a single L3 case forks N chip-processes
# that all write l2_perf_records_<ts>.json to the same directory with
# that all write l2_swimlane_records_<ts>.json to the same directory with
# second-precision timestamps, so they trample each other. Block the
# combination up front; waiting for a proper device-id-in-filename fix.
if config.getoption("--enable-l2-swimlane", default=0):
Expand Down
10 changes: 5 additions & 5 deletions docs/dfx/dep_gen.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ The swimlane profiler's per-task `fanout[]` array is the obvious place to
read "which tasks did task X feed into?" — but it is **structurally
incomplete on real hardware**.

Each producer task carries its own `L2PerfRecord.fanout[RUNTIME_MAX_FANOUT]`,
Each producer task carries its own `L2SwimlaneAicpuTaskRecord.fanout[RUNTIME_MAX_FANOUT]`,
populated by the AICPU scheduler at the moment it wires a downstream
consumer. If a producer has already finished and transitioned to
`PTO2_TASK_COMPLETED` by the time a later submit wants to register a
Expand Down Expand Up @@ -84,7 +84,7 @@ The `--enable-l2-swimlane` flag is independent but recommended in pair
because:

- `deps.json` is the dep_gen artifact.
- `l2_perf_records.json` (from swimlane) is the timing artifact;
- `l2_swimlane_records.json` (from swimlane) is the timing artifact;
`merged_swimlane.json` (the Perfetto trace) uses `deps.json` for
dependency arrows when both files exist.
- The "fanout ⊆ deps" validation gate fires only when both files are
Expand Down Expand Up @@ -262,7 +262,7 @@ Node visual encoding (legend top-right of the rendered HTML):
| Gray dashed note | alloc — task from `alloc_tensors` (got a task_id, references downstream via `owner_task_id`, but never dispatched a kernel so has no perf record) |

Labels read as `(ring, local) · func_name · core_type-implicit-via-shape`.
When a colocated `l2_perf_records.json` is present the func_id is enriched
When a colocated `l2_swimlane_records.json` is present the func_id is enriched
with the kernel name via the sibling `name_map_<case>.json` (written by
SceneTest's `_dump_name_map`).

Expand All @@ -288,11 +288,11 @@ sources / args / slices, so the raw `edges[]` count is a superset of the
underlying task-pair count.

`deps.json` (projected) is a **superset** of the fanout edges in
`l2_perf_records.json`:
`l2_swimlane_records.json`:

| Edge source | Captures | Drops on race? |
| ----------- | -------- | -------------- |
| `task.fanout[]` (L2PerfRecord) | Successors known at producer-retire time | **Yes** — sealed when producer retires |
| `task.fanout[]` (L2SwimlaneAicpuTaskRecord) | Successors known at producer-retire time | **Yes** — sealed when producer retires |
| `deps.json` (this feature) | Every consumer → producer reachable via tensormap / explicit_deps | No — replay sees every submit |

`tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/test_dep_gen_capture.py`
Expand Down
168 changes: 84 additions & 84 deletions docs/dfx/l2-swimlane-profiling.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/dfx/pmu-profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -301,7 +301,7 @@ shared-memory layout, an `init()` that allocates and pre-fills the free
queues, an `on_buffer_collected()` callback that appends records to the
CSV, and `reconcile_counters()` / `finalize()`. The mgmt/poll threading,
buffer pooling, and `Module` trait pattern are shared with TensorDump
and L2Perf — see [profiling-framework.md](../profiling-framework.md) for
and L2Swimlane — see [profiling-framework.md](../profiling-framework.md) for
the framework reference.

### 5.3 a5 — same framework, host-shadow transport (DAV_3510, 10 counters)
Expand Down
2 changes: 1 addition & 1 deletion docs/dfx/tensor-dump.md
Original file line number Diff line number Diff line change
Expand Up @@ -432,7 +432,7 @@ allocates and pre-fills free queues, an `on_buffer_collected`
callback that gathers payload bytes into the in-memory record
list, plus `reconcile_counters` / `export_dump_files` /
`finalize`. The mgmt/poll threading, buffer pooling, and `Module`
trait pattern are shared with PMU and L2Perf — see
trait pattern are shared with PMU and L2Swimlane — see
[profiling-framework.md](../profiling-framework.md) for the
framework reference.

Expand Down
12 changes: 6 additions & 6 deletions docs/hardware/cache-coherency.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,19 +80,19 @@ Two separate concerns, often conflated:
stale value from a previous round). The AICPU side must emit
`rmb()` between the COND check and the slot reads.

Concretely, the L2 perf staging-slot read in
`src/{a2a3,a5}/platform/src/aicpu/l2_perf_collector_aicpu.cpp` does
Concretely, the L2 swimlane staging-slot read in
`src/{a2a3,a5}/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp` does
**not** call `cache_invalidate_range` on the slot, but it **does** call
`rmb()` before reading `slot->task_id` and the timing fields. All of
those fields are AICore writes covered by the AICore-side `dcci` in
`l2_perf_aicore_record_task`. The same pattern applies to the PMU
`l2_swimlane_aicore_record_task`. The same pattern applies to the PMU
staging slot
(`src/{a2a3,a5}/platform/src/aicpu/pmu_collector_aicpu.cpp`).

### Historical pitfall

PR #540 (2026-04-15) added `cache_invalidate_range(slot, 64)` on the
AICPU side of the L2 perf staging slot, mirroring the
AICPU side of the L2 swimlane staging slot, mirroring the
host-DMA-protocol pattern from PR #204. The two situations are
**not** the same: host DMA bypasses the AICPU cache; AICore stores
plus `dcci` do not. The cache invalidate was redundant — but the
Expand Down Expand Up @@ -171,11 +171,11 @@ forever once they ship.

- `src/{a2a3,a5}/platform/onboard/aicpu/cache_ops.cpp` — `cache_invalidate_range` implementation (`dc civac` / `dsb sy` / `isb`).
- `src/{a2a3,a5}/platform/sim/aicpu/cache_ops.cpp` — sim no-op.
- AICore-side `dcci` usage lives in the L2 perf / PMU AICore collectors and any kernel that publishes to a GM slot AICPU reads.
- AICore-side `dcci` usage lives in the L2 swimlane / PMU AICore collectors and any kernel that publishes to a GM slot AICPU reads.

## Related docs

- [PMU staging-slot ordering](../dfx/pmu-profiling.md) —
detailed AICore-side `dcci` + barrier order for staging-slot writes.
- [L2 swimlane profiling](../dfx/l2-swimlane-profiling.md) —
the consumer of the rules above on the L2 perf path.
the consumer of the rules above on the L2 swimlane path.
30 changes: 15 additions & 15 deletions docs/profiling-framework.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Profiling Framework

Shared host-side infrastructure that the PMU, L2Perf, and TensorDump
Shared host-side infrastructure that the PMU, L2Swimlane, and TensorDump
collectors are built on. Each architecture maintains its own copy of the
framework headers under `src/<arch>/platform/include/host/profiling_common/`
([a2a3](../src/a2a3/platform/include/host/profiling_common/),
Expand All @@ -25,7 +25,7 @@ Each profiling subsystem on a2a3 needs the same plumbing on the host:
- A collector thread that drains the host-side hand-off queue and copies
records out of each ready buffer.
- A pool of pre-registered device buffers (allocated up-front, refilled on
demand) keyed by "kind" — PMU has 1 kind, TensorDump has 1, L2Perf has 2
demand) keyed by "kind" — PMU has 1 kind, TensorDump has 1, L2Swimlane has 2
(perf records + phase markers).
- A dev↔host pointer map so the management thread can resolve a device
pointer popped off a ready queue to the host-mapped pointer the collector
Expand All @@ -40,7 +40,7 @@ a small per-subsystem trait.

```text
┌──────────────────────────────────────────┐
│ PmuCollector / L2PerfCollector / │ Derived (CRTP)
│ PmuCollector / L2SwimlaneCollector / │ Derived (CRTP)
│ TensorDumpCollector │ ─ on_buffer_collected
└─────────────┬────────────────────────────┘ ─ kIdleTimeoutSec / kSubsystemName
│ public ProfilerBase<Derived, Module>
Expand All @@ -58,7 +58,7 @@ a small per-subsystem trait.
│ Module trait wires layout into algorithms
┌───────────────┴────────────────┐
│ PmuModule / L2PerfModule / │ Pure static trait (no state)
│ PmuModule / L2SwimlaneModule / │ Pure static trait (no state)
│ DumpModule │ ─ DataHeader / ReadyEntry / FreeQueue
└────────────────────────────────┘ ─ kBufferKinds / kReadyQueueSize
─ resolve_entry / for_each_instance
Expand Down Expand Up @@ -129,7 +129,7 @@ is where the unified algorithms live:

### 3.3 `Module` — trait layer

A stateless `struct` per subsystem (`PmuModule`, `L2PerfModule`,
A stateless `struct` per subsystem (`PmuModule`, `L2SwimlaneModule`,
`DumpModule`) that tells the generic algorithms what the shared-memory
layout looks like. The contract lives in the docblock at the top of
[`profiler_base.h`](../src/a2a3/platform/include/host/profiling_common/profiler_base.h);
Expand All @@ -138,7 +138,7 @@ the required members are:
| Member | Purpose |
| ------ | ------- |
| `using DataHeader / ReadyEntry / ReadyBufferInfo / FreeQueue` | Layout types |
| `kBufferKinds` (PMU=1, Dump=1, L2Perf=2) | Number of per-kind recycled pools |
| `kBufferKinds` (PMU=1, Dump=1, L2Swimlane=2) | Number of per-kind recycled pools |
| `kReadyQueueSize`, `kSlotCount` | AICPU ready queue / free queue depth |
| `kSubsystemName` | Tag used in framework log lines |
| `header_from_shm(void*) → DataHeader*` | Cast shared-memory base to header |
Expand All @@ -149,7 +149,7 @@ the required members are:

The Module structs are defined alongside their collectors in
[pmu_collector.h](../src/a2a3/platform/include/host/pmu_collector.h),
[l2_perf_collector.h](../src/a2a3/platform/include/host/l2_perf_collector.h),
[l2_swimlane_collector.h](../src/a2a3/platform/include/host/l2_swimlane_collector.h),
and [tensor_dump_collector.h](../src/a2a3/platform/include/host/tensor_dump_collector.h)
— each is a few dozen lines of static methods over the subsystem's own
`DataHeader` / ringbuffer types.
Expand All @@ -168,7 +168,7 @@ and only has to provide:
the collector loop. Use the subsystem's `PLATFORM_*_TIMEOUT_SECONDS`
constant.
- `static constexpr const char* kSubsystemName` — appears in the idle
timeout log line (e.g. `"PMU"`, `"L2Perf"`, `"TensorDump"`).
timeout log line (e.g. `"PMU"`, `"L2Swimlane"`, `"TensorDump"`).
- `init(...)` and `finalize(...)` — domain-specific setup/teardown.
`init` must call `set_memory_context()` on the success path so
`start(tf)` is not a no-op. `finalize` must release framework-owned
Expand Down Expand Up @@ -297,7 +297,7 @@ Existing collectors are the canonical examples:
— single kind, per-core instances. See [pmu-profiling.md](dfx/pmu-profiling.md).
- [`TensorDumpCollector`](../src/a2a3/platform/include/host/tensor_dump_collector.h)
— single kind, per-AICPU-thread instances. See [tensor-dump.md](dfx/tensor-dump.md).
- [`L2PerfCollector`](../src/a2a3/platform/include/host/l2_perf_collector.h)
- [`L2SwimlaneCollector`](../src/a2a3/platform/include/host/l2_swimlane_collector.h)
— two kinds (perf records + phase markers), per-core / per-thread
instances; the canonical multi-kind example. See
[l2-swimlane-profiling.md](dfx/l2-swimlane-profiling.md).
Expand Down Expand Up @@ -332,8 +332,8 @@ changes capture that:
**not** called from the mgmt loop — it would race with AICPU writes
to device-only fields (`current_buf_ptr`, `total/dropped/mismatch`
counters, `queue_tails`, `free_queue.head`,
`AicpuPhaseHeader::magic`, `core_to_thread[]`), rolling them back
to whatever the host shadow had at the start of the tick. Per-buffer payloads (`L2PerfBuffer` / `PmuBuffer` /
`L2SwimlaneAicpuPhaseHeader::magic`, `core_to_thread[]`), rolling them back
to whatever the host shadow had at the start of the tick. Per-buffer payloads (`L2SwimlaneAicpuTaskBuffer` / `PmuBuffer` /
`DumpMetaBuffer`) are still pulled on demand inside
`ProfilerAlgorithms::process_entry` after resolving the host pointer
for a popped ready entry. The bulk `mirror_shm_to_device` is kept
Expand Down Expand Up @@ -363,7 +363,7 @@ per-core ring/reg addresses travel through `KernelArgs`:
| `KernelArgs` field | Producer | Consumer |
| ------------------ | -------- | -------- |
| `enable_profiling_flag` (bitmask) | host (DeviceRunner) | AICPU `kernel.cpp` → `set_l2_swimlane_enabled` / `set_pmu_enabled` / `set_dump_tensor_enabled`; AICore `KERNEL_ENTRY` → `set_aicore_profiling_flag` |
| `aicore_l2_perf_ring_addrs` (table) | host (`L2PerfCollector::initialize`) | AICore `KERNEL_ENTRY` indexes `table[block_idx]` → `set_aicore_l2_perf_ring` |
| `aicore_l2_swimlane_ring_addrs` (table) | host (`L2SwimlaneCollector::initialize`) | AICore `KERNEL_ENTRY` indexes `table[block_idx]` → `set_aicore_l2_swimlane_ring` |
| `aicore_pmu_ring_addrs` (table) | host (`PmuCollector::init`) | AICore `KERNEL_ENTRY` → `set_aicore_pmu_ring` |
| `regs` (per-physical-core register-base table) | host (already required for AICPU MMIO) | AICore `KERNEL_ENTRY` resolves `regs[get_physical_core_id()]` → `set_aicore_pmu_reg_base`; AICore `aicore_execute` caches the value at Phase-3 |

Expand All @@ -376,16 +376,16 @@ state surface, never the runtime protocol.

### 8.2 Stable AICore staging ring (decouples AICore write from AICPU buffer rotation)

L2Perf and PMU on a5 both use the "AICore writes, AICPU commits" model.
L2Swimlane and PMU on a5 both use the "AICore writes, AICPU commits" model.
The AICore-side write target is a per-core
[`L2PerfAicoreRing`](../src/a5/platform/include/common/l2_perf_profiling.h) /
[`L2SwimlaneAicoreRing`](../src/a5/platform/include/common/l2_swimlane_profiling.h) /
[`PmuAicoreRing`](../src/a5/platform/include/common/pmu_profiling.h) of
`PLATFORM_{L2,PMU}_AICORE_RING_SIZE` (= 2, dual-issue) slots, allocated
once by the host and addressed by
`BufferState::aicore_ring_ptr` (AICPU-visible) and the per-core
`aicore_*_ring_addrs[block_idx]` (AICore-visible). The address is
never reassigned, so AICore's write target is stable across AICPU's
rotating `L2PerfBuffer` / `PmuBuffer` flips — flipping is now
rotating `L2SwimlaneAicpuTaskBuffer` / `PmuBuffer` flips — flipping is now
fully internal to `*_complete_record` and never crosses into Handshake.

Everything else — Module concept contract, alloc policy
Expand Down
10 changes: 5 additions & 5 deletions docs/profiling-name-map.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Problem

Profiling data (`l2_perf_records.json`) identifies tasks by numeric IDs
Profiling data (`l2_swimlane_records.json`) identifies tasks by numeric IDs
(e.g., `func_id: 0`). Without a mapping, swimlane visualizations show
opaque labels like `func_0_a(t0)` instead of human-readable names like
`QK(t0)`.
Expand Down Expand Up @@ -45,7 +45,7 @@ Every level uses the same structure:
### L2 (Orchestration + Incores)

`callable_id` = incore `func_id` (the integer assigned in the CALLABLE
spec). These are the same IDs that appear in L2 perf data.
spec). These are the same IDs that appear in L2 swimlane data.

```json
{
Expand Down Expand Up @@ -147,10 +147,10 @@ takes precedence over `-k` (kernel_config.py):
# Automatic (via SceneTest profiling)
pytest tests/st/... --platform a5onboard --enable-l2-swimlane

# Manual (paths land alongside l2_perf_records.json inside the same
# Manual (paths land alongside l2_swimlane_records.json inside the same
# <output_prefix> directory)
python -m simpler_setup.tools.swimlane_converter \
outputs/<case>_<ts>/l2_perf_records.json \
outputs/<case>_<ts>/l2_swimlane_records.json \
--func-names outputs/<case>_<ts>/name_map_TestPA_basic.json

python -m simpler_setup.tools.deps_to_graph \
Expand All @@ -169,7 +169,7 @@ cannot collide.

```text
outputs/TestPA_basic_20260416_151301/
l2_perf_records.json # perf data (runtime)
l2_swimlane_records.json # perf data (runtime)
name_map_TestPA_basic.json # name mapping (SceneTest)
merged_swimlane.json # Perfetto trace (converter)
```
2 changes: 1 addition & 1 deletion docs/sim_multi_device_isolation.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Communication uses a 4096-byte shared-memory mailbox per chip — the same layou

## Why Not Fix the Globals

The global state in `host_runtime.so` spans multiple files (`cpu_sim_context.cpp`, `platform_aicpu_affinity.cpp`, `l2_perf_collector_aicpu.cpp`, `device_log.cpp`) and is deeply embedded in the AICPU/AICore thread model. Fixing each one individually is fragile. Process isolation solves all of them at once with zero platform code changes.
The global state in `host_runtime.so` spans multiple files (`cpu_sim_context.cpp`, `platform_aicpu_affinity.cpp`, `l2_swimlane_collector_aicpu.cpp`, `device_log.cpp`) and is deeply embedded in the AICPU/AICore thread model. Fixing each one individually is fragile. Process isolation solves all of them at once with zero platform code changes.

## Files

Expand Down
6 changes: 3 additions & 3 deletions docs/testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ python test_xxx.py -p a2a3sim --log-level debug # verbose C++ l
| `--case SEL` | | (all) | Case selector, repeatable: `Foo`, `ClassA::Foo`, `ClassA::` |
| `--manual` | | `exclude` | `exclude`/`include`/`only` for manual cases |
| `--skip-golden` | | false | Skip golden comparison (for benchmarking) |
| `--enable-l2-swimlane [PERF_LEVEL]` | | `0` | Enable L2 swimlane collection on first round only. The flag takes an integer perf_level 0–4 (bare = 4); see [docs/dfx/l2-swimlane-profiling.md](dfx/l2-swimlane-profiling.md#31-enable-l2-swimlane) for the level table. Each test case gets its own `outputs/<case>_<ts>/` directory under which `l2_perf_records.json` lands; parallel runs never collide. |
| `--enable-l2-swimlane [PERF_LEVEL]` | | `0` | Enable L2 swimlane collection on first round only. The flag takes an integer perf_level 0–4 (bare = 4); see [docs/dfx/l2-swimlane-profiling.md](dfx/l2-swimlane-profiling.md#31-enable-l2-swimlane) for the level table. Each test case gets its own `outputs/<case>_<ts>/` directory under which `l2_swimlane_records.json` lands; parallel runs never collide. |
| `--dump-tensor` | | false | Dump per-task tensor I/O during runtime execution |
| `--enable-pmu [EVENT_TYPE]` | | `0` | Enable a2a3 PMU CSV collection. Bare flag selects `PIPE_UTILIZATION` (`2`); pass an event type such as `4` for `MEMORY`. |
| `--exitfirst` | `-x` | false | Stop on first failing test (fail-fast, primarily for CI) |
Expand Down Expand Up @@ -318,13 +318,13 @@ A single file can declare both L2 and L3 classes; they're grouped by `(runtime,

Each test case sets its own `CallConfig.output_prefix` (chosen by `scene_test.py::_build_output_prefix` as `outputs/<ClassName>_<case>_<YYYYMMDD_HHMMSS>/`). The C++ runtime writes all diagnostic artifacts under that prefix with fixed filenames:

- `outputs/<case>_<ts>/l2_perf_records.json` — swimlane (`--enable-l2-swimlane`)
- `outputs/<case>_<ts>/l2_swimlane_records.json` — swimlane (`--enable-l2-swimlane`)
- `outputs/<case>_<ts>/tensor_dump/` — tensor dump (`--dump-tensor`)
- `outputs/<case>_<ts>/pmu.csv` — PMU counters (`--enable-pmu`)

Because each case gets its own directory, parallel runs (xdist workers, L3 case fanout, L2 device fanout) can never collide on filename — there is no per-file timestamp, no env-var scoping, and no post-run flatten step. `CallConfig::validate()` throws if any diagnostic flag is enabled but `output_prefix` is empty; `scene_test.py::run_class_cases` always fills it from the case label.

Standalone invocations of CLIs (`python -m simpler_setup.tools.swimlane_converter`, etc.) auto-detect the latest `outputs/*/l2_perf_records.json` (sorted by mtime); pass `--input <path>` to override.
Standalone invocations of CLIs (`python -m simpler_setup.tools.swimlane_converter`, etc.) auto-detect the latest `outputs/*/l2_swimlane_records.json` (sorted by mtime); pass `--input <path>` to override.

### Dispatcher skip conditions (normal pytest runs)

Expand Down
Loading
Loading