hw-native-sys · ChaoWao · May 31, 2026 · May 30, 2026
diff --git a/conftest.py b/conftest.py
@@ -572,7 +572,7 @@ def sort_key(item):
     items.sort(key=sort_key)
 
     # L3 perf collection is not supported yet: a single L3 case forks N chip-processes
-    # that all write l2_perf_records_<ts>.json to the same directory with
+    # that all write l2_swimlane_records_<ts>.json to the same directory with
     # second-precision timestamps, so they trample each other. Block the
     # combination up front; waiting for a proper device-id-in-filename fix.
     if config.getoption("--enable-l2-swimlane", default=0):

diff --git a/docs/dfx/dep_gen.md b/docs/dfx/dep_gen.md
@@ -6,7 +6,7 @@ The swimlane profiler's per-task `fanout[]` array is the obvious place to
 read "which tasks did task X feed into?" — but it is **structurally
 incomplete on real hardware**.
 
-Each producer task carries its own `L2PerfRecord.fanout[RUNTIME_MAX_FANOUT]`,
+Each producer task carries its own `L2SwimlaneAicpuTaskRecord.fanout[RUNTIME_MAX_FANOUT]`,
 populated by the AICPU scheduler at the moment it wires a downstream
 consumer. If a producer has already finished and transitioned to
 `PTO2_TASK_COMPLETED` by the time a later submit wants to register a
@@ -84,7 +84,7 @@ The `--enable-l2-swimlane` flag is independent but recommended in pair
 because:
 
 - `deps.json` is the dep_gen artifact.
-- `l2_perf_records.json` (from swimlane) is the timing artifact;
+- `l2_swimlane_records.json` (from swimlane) is the timing artifact;
   `merged_swimlane.json` (the Perfetto trace) uses `deps.json` for
   dependency arrows when both files exist.
 - The "fanout ⊆ deps" validation gate fires only when both files are
@@ -262,7 +262,7 @@ Node visual encoding (legend top-right of the rendered HTML):
 | Gray dashed note | alloc — task from `alloc_tensors` (got a task_id, references downstream via `owner_task_id`, but never dispatched a kernel so has no perf record) |
 
 Labels read as `(ring, local) · func_name · core_type-implicit-via-shape`.
-When a colocated `l2_perf_records.json` is present the func_id is enriched
+When a colocated `l2_swimlane_records.json` is present the func_id is enriched
 with the kernel name via the sibling `name_map_<case>.json` (written by
 SceneTest's `_dump_name_map`).
 
@@ -288,11 +288,11 @@ sources / args / slices, so the raw `edges[]` count is a superset of the
 underlying task-pair count.
 
 `deps.json` (projected) is a **superset** of the fanout edges in
-`l2_perf_records.json`:
+`l2_swimlane_records.json`:
 
 | Edge source | Captures | Drops on race? |
 | ----------- | -------- | -------------- |
-| `task.fanout[]` (L2PerfRecord) | Successors known at producer-retire time | **Yes** — sealed when producer retires |
+| `task.fanout[]` (L2SwimlaneAicpuTaskRecord) | Successors known at producer-retire time | **Yes** — sealed when producer retires |
 | `deps.json` (this feature) | Every consumer → producer reachable via tensormap / explicit_deps | No — replay sees every submit |
 
 `tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/test_dep_gen_capture.py`

diff --git a/docs/dfx/l2-swimlane-profiling.md b/docs/dfx/l2-swimlane-profiling.md
diff --git a/docs/dfx/pmu-profiling.md b/docs/dfx/pmu-profiling.md
@@ -301,7 +301,7 @@ shared-memory layout, an `init()` that allocates and pre-fills the free
 queues, an `on_buffer_collected()` callback that appends records to the
 CSV, and `reconcile_counters()` / `finalize()`. The mgmt/poll threading,
 buffer pooling, and `Module` trait pattern are shared with TensorDump
-and L2Perf — see [profiling-framework.md](../profiling-framework.md) for
+and L2Swimlane — see [profiling-framework.md](../profiling-framework.md) for
 the framework reference.
 
 ### 5.3 a5 — same framework, host-shadow transport (DAV_3510, 10 counters)

diff --git a/docs/dfx/tensor-dump.md b/docs/dfx/tensor-dump.md
@@ -432,7 +432,7 @@ allocates and pre-fills free queues, an `on_buffer_collected`
 callback that gathers payload bytes into the in-memory record
 list, plus `reconcile_counters` / `export_dump_files` /
 `finalize`. The mgmt/poll threading, buffer pooling, and `Module`
-trait pattern are shared with PMU and L2Perf — see
+trait pattern are shared with PMU and L2Swimlane — see
 [profiling-framework.md](../profiling-framework.md) for the
 framework reference.
 

diff --git a/docs/hardware/cache-coherency.md b/docs/hardware/cache-coherency.md
@@ -80,19 +80,19 @@ Two separate concerns, often conflated:
   stale value from a previous round). The AICPU side must emit
   `rmb()` between the COND check and the slot reads.
 
-Concretely, the L2 perf staging-slot read in
-`src/{a2a3,a5}/platform/src/aicpu/l2_perf_collector_aicpu.cpp` does
+Concretely, the L2 swimlane staging-slot read in
+`src/{a2a3,a5}/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp` does
 **not** call `cache_invalidate_range` on the slot, but it **does** call
 `rmb()` before reading `slot->task_id` and the timing fields. All of
 those fields are AICore writes covered by the AICore-side `dcci` in
-`l2_perf_aicore_record_task`. The same pattern applies to the PMU
+`l2_swimlane_aicore_record_task`. The same pattern applies to the PMU
 staging slot
 (`src/{a2a3,a5}/platform/src/aicpu/pmu_collector_aicpu.cpp`).
 
 ### Historical pitfall
 
 PR #540 (2026-04-15) added `cache_invalidate_range(slot, 64)` on the
-AICPU side of the L2 perf staging slot, mirroring the
+AICPU side of the L2 swimlane staging slot, mirroring the
 host-DMA-protocol pattern from PR #204. The two situations are
 **not** the same: host DMA bypasses the AICPU cache; AICore stores
 plus `dcci` do not. The cache invalidate was redundant — but the
@@ -171,11 +171,11 @@ forever once they ship.
 
 - `src/{a2a3,a5}/platform/onboard/aicpu/cache_ops.cpp` — `cache_invalidate_range` implementation (`dc civac` / `dsb sy` / `isb`).
 - `src/{a2a3,a5}/platform/sim/aicpu/cache_ops.cpp` — sim no-op.
-- AICore-side `dcci` usage lives in the L2 perf / PMU AICore collectors and any kernel that publishes to a GM slot AICPU reads.
+- AICore-side `dcci` usage lives in the L2 swimlane / PMU AICore collectors and any kernel that publishes to a GM slot AICPU reads.
 
 ## Related docs
 
 - [PMU staging-slot ordering](../dfx/pmu-profiling.md) —
   detailed AICore-side `dcci` + barrier order for staging-slot writes.
 - [L2 swimlane profiling](../dfx/l2-swimlane-profiling.md) —
-  the consumer of the rules above on the L2 perf path.
+  the consumer of the rules above on the L2 swimlane path.
diff --git a/docs/profiling-framework.md b/docs/profiling-framework.md
@@ -1,6 +1,6 @@
 # Profiling Framework
 
-Shared host-side infrastructure that the PMU, L2Perf, and TensorDump
+Shared host-side infrastructure that the PMU, L2Swimlane, and TensorDump
 collectors are built on. Each architecture maintains its own copy of the
 framework headers under `src/<arch>/platform/include/host/profiling_common/`
 ([a2a3](../src/a2a3/platform/include/host/profiling_common/),
@@ -25,7 +25,7 @@ Each profiling subsystem on a2a3 needs the same plumbing on the host:
 - A collector thread that drains the host-side hand-off queue and copies
   records out of each ready buffer.
 - A pool of pre-registered device buffers (allocated up-front, refilled on
-  demand) keyed by "kind" — PMU has 1 kind, TensorDump has 1, L2Perf has 2
+  demand) keyed by "kind" — PMU has 1 kind, TensorDump has 1, L2Swimlane has 2
   (perf records + phase markers).
 - A dev↔host pointer map so the management thread can resolve a device
   pointer popped off a ready queue to the host-mapped pointer the collector
@@ -40,7 +40,7 @@ a small per-subsystem trait.
 
 ```text
                 ┌──────────────────────────────────────────┐
-                │  PmuCollector / L2PerfCollector /        │  Derived (CRTP)
+                │  PmuCollector / L2SwimlaneCollector /        │  Derived (CRTP)
                 │  TensorDumpCollector                     │  ─ on_buffer_collected
                 └─────────────┬────────────────────────────┘  ─ kIdleTimeoutSec / kSubsystemName
                               │ public ProfilerBase<Derived, Module>
@@ -58,7 +58,7 @@ a small per-subsystem trait.
                               ▲
                               │ Module trait wires layout into algorithms
               ┌───────────────┴────────────────┐
-              │  PmuModule / L2PerfModule /    │  Pure static trait (no state)
+              │  PmuModule / L2SwimlaneModule /    │  Pure static trait (no state)
               │  DumpModule                    │  ─ DataHeader / ReadyEntry / FreeQueue
               └────────────────────────────────┘  ─ kBufferKinds / kReadyQueueSize
                                                   ─ resolve_entry / for_each_instance
@@ -129,7 +129,7 @@ is where the unified algorithms live:
 
 ### 3.3 `Module` — trait layer
 
-A stateless `struct` per subsystem (`PmuModule`, `L2PerfModule`,
+A stateless `struct` per subsystem (`PmuModule`, `L2SwimlaneModule`,
 `DumpModule`) that tells the generic algorithms what the shared-memory
 layout looks like. The contract lives in the docblock at the top of
 [`profiler_base.h`](../src/a2a3/platform/include/host/profiling_common/profiler_base.h);
@@ -138,7 +138,7 @@ the required members are:
 | Member | Purpose |
 | ------ | ------- |
 | `using DataHeader / ReadyEntry / ReadyBufferInfo / FreeQueue` | Layout types |
-| `kBufferKinds` (PMU=1, Dump=1, L2Perf=2) | Number of per-kind recycled pools |
+| `kBufferKinds` (PMU=1, Dump=1, L2Swimlane=2) | Number of per-kind recycled pools |
 | `kReadyQueueSize`, `kSlotCount` | AICPU ready queue / free queue depth |
 | `kSubsystemName` | Tag used in framework log lines |
 | `header_from_shm(void*) → DataHeader*` | Cast shared-memory base to header |
@@ -149,7 +149,7 @@ the required members are:
 
 The Module structs are defined alongside their collectors in
 [pmu_collector.h](../src/a2a3/platform/include/host/pmu_collector.h),
-[l2_perf_collector.h](../src/a2a3/platform/include/host/l2_perf_collector.h),
+[l2_swimlane_collector.h](../src/a2a3/platform/include/host/l2_swimlane_collector.h),
 and [tensor_dump_collector.h](../src/a2a3/platform/include/host/tensor_dump_collector.h)
 — each is a few dozen lines of static methods over the subsystem's own
 `DataHeader` / ringbuffer types.
@@ -168,7 +168,7 @@ and only has to provide:
   the collector loop. Use the subsystem's `PLATFORM_*_TIMEOUT_SECONDS`
   constant.
 - `static constexpr const char* kSubsystemName` — appears in the idle
-  timeout log line (e.g. `"PMU"`, `"L2Perf"`, `"TensorDump"`).
+  timeout log line (e.g. `"PMU"`, `"L2Swimlane"`, `"TensorDump"`).
 - `init(...)` and `finalize(...)` — domain-specific setup/teardown.
   `init` must call `set_memory_context()` on the success path so
   `start(tf)` is not a no-op. `finalize` must release framework-owned
@@ -297,7 +297,7 @@ Existing collectors are the canonical examples:
   — single kind, per-core instances. See [pmu-profiling.md](dfx/pmu-profiling.md).
 - [`TensorDumpCollector`](../src/a2a3/platform/include/host/tensor_dump_collector.h)
   — single kind, per-AICPU-thread instances. See [tensor-dump.md](dfx/tensor-dump.md).
-- [`L2PerfCollector`](../src/a2a3/platform/include/host/l2_perf_collector.h)
+- [`L2SwimlaneCollector`](../src/a2a3/platform/include/host/l2_swimlane_collector.h)
   — two kinds (perf records + phase markers), per-core / per-thread
   instances; the canonical multi-kind example. See
   [l2-swimlane-profiling.md](dfx/l2-swimlane-profiling.md).
@@ -332,8 +332,8 @@ changes capture that:
    **not** called from the mgmt loop — it would race with AICPU writes
    to device-only fields (`current_buf_ptr`, `total/dropped/mismatch`
    counters, `queue_tails`, `free_queue.head`,
-   `AicpuPhaseHeader::magic`, `core_to_thread[]`), rolling them back
-   to whatever the host shadow had at the start of the tick. Per-buffer payloads (`L2PerfBuffer` / `PmuBuffer` /
+   `L2SwimlaneAicpuPhaseHeader::magic`, `core_to_thread[]`), rolling them back
+   to whatever the host shadow had at the start of the tick. Per-buffer payloads (`L2SwimlaneAicpuTaskBuffer` / `PmuBuffer` /
    `DumpMetaBuffer`) are still pulled on demand inside
    `ProfilerAlgorithms::process_entry` after resolving the host pointer
    for a popped ready entry. The bulk `mirror_shm_to_device` is kept
@@ -363,7 +363,7 @@ per-core ring/reg addresses travel through `KernelArgs`:
 | `KernelArgs` field | Producer | Consumer |
 | ------------------ | -------- | -------- |
 | `enable_profiling_flag` (bitmask) | host (DeviceRunner) | AICPU `kernel.cpp` → `set_l2_swimlane_enabled` / `set_pmu_enabled` / `set_dump_tensor_enabled`; AICore `KERNEL_ENTRY` → `set_aicore_profiling_flag` |
-| `aicore_l2_perf_ring_addrs` (table) | host (`L2PerfCollector::initialize`) | AICore `KERNEL_ENTRY` indexes `table[block_idx]` → `set_aicore_l2_perf_ring` |
+| `aicore_l2_swimlane_ring_addrs` (table) | host (`L2SwimlaneCollector::initialize`) | AICore `KERNEL_ENTRY` indexes `table[block_idx]` → `set_aicore_l2_swimlane_ring` |
 | `aicore_pmu_ring_addrs` (table) | host (`PmuCollector::init`) | AICore `KERNEL_ENTRY` → `set_aicore_pmu_ring` |
 | `regs` (per-physical-core register-base table) | host (already required for AICPU MMIO) | AICore `KERNEL_ENTRY` resolves `regs[get_physical_core_id()]` → `set_aicore_pmu_reg_base`; AICore `aicore_execute` caches the value at Phase-3 |
 
@@ -376,16 +376,16 @@ state surface, never the runtime protocol.
 
 ### 8.2 Stable AICore staging ring (decouples AICore write from AICPU buffer rotation)
 
-L2Perf and PMU on a5 both use the "AICore writes, AICPU commits" model.
+L2Swimlane and PMU on a5 both use the "AICore writes, AICPU commits" model.
 The AICore-side write target is a per-core
-[`L2PerfAicoreRing`](../src/a5/platform/include/common/l2_perf_profiling.h) /
+[`L2SwimlaneAicoreRing`](../src/a5/platform/include/common/l2_swimlane_profiling.h) /
 [`PmuAicoreRing`](../src/a5/platform/include/common/pmu_profiling.h) of
 `PLATFORM_{L2,PMU}_AICORE_RING_SIZE` (= 2, dual-issue) slots, allocated
 once by the host and addressed by
 `BufferState::aicore_ring_ptr` (AICPU-visible) and the per-core
 `aicore_*_ring_addrs[block_idx]` (AICore-visible). The address is
 never reassigned, so AICore's write target is stable across AICPU's
-rotating `L2PerfBuffer` / `PmuBuffer` flips — flipping is now
+rotating `L2SwimlaneAicpuTaskBuffer` / `PmuBuffer` flips — flipping is now
 fully internal to `*_complete_record` and never crosses into Handshake.
 
 Everything else — Module concept contract, alloc policy

diff --git a/docs/profiling-name-map.md b/docs/profiling-name-map.md
@@ -2,7 +2,7 @@
 
 ## Problem
 
-Profiling data (`l2_perf_records.json`) identifies tasks by numeric IDs
+Profiling data (`l2_swimlane_records.json`) identifies tasks by numeric IDs
 (e.g., `func_id: 0`).  Without a mapping, swimlane visualizations show
 opaque labels like `func_0_a(t0)` instead of human-readable names like
 `QK(t0)`.
@@ -45,7 +45,7 @@ Every level uses the same structure:
 ### L2 (Orchestration + Incores)
 
 `callable_id` = incore `func_id` (the integer assigned in the CALLABLE
-spec).  These are the same IDs that appear in L2 perf data.
+spec).  These are the same IDs that appear in L2 swimlane data.
 
 ```json
 {
@@ -147,10 +147,10 @@ takes precedence over `-k` (kernel_config.py):
 # Automatic (via SceneTest profiling)
 pytest tests/st/... --platform a5onboard --enable-l2-swimlane
 
-# Manual (paths land alongside l2_perf_records.json inside the same
+# Manual (paths land alongside l2_swimlane_records.json inside the same
 # <output_prefix> directory)
 python -m simpler_setup.tools.swimlane_converter \
-    outputs/<case>_<ts>/l2_perf_records.json \
+    outputs/<case>_<ts>/l2_swimlane_records.json \
     --func-names outputs/<case>_<ts>/name_map_TestPA_basic.json
 
 python -m simpler_setup.tools.deps_to_graph \
@@ -169,7 +169,7 @@ cannot collide.
 
 ```text
 outputs/TestPA_basic_20260416_151301/
-  l2_perf_records.json         # perf data (runtime)
+  l2_swimlane_records.json         # perf data (runtime)
   name_map_TestPA_basic.json   # name mapping (SceneTest)
   merged_swimlane.json         # Perfetto trace (converter)
 ```
diff --git a/docs/sim_multi_device_isolation.md b/docs/sim_multi_device_isolation.md
@@ -24,7 +24,7 @@ Communication uses a 4096-byte shared-memory mailbox per chip — the same layou
 
 ## Why Not Fix the Globals
 
-The global state in `host_runtime.so` spans multiple files (`cpu_sim_context.cpp`, `platform_aicpu_affinity.cpp`, `l2_perf_collector_aicpu.cpp`, `device_log.cpp`) and is deeply embedded in the AICPU/AICore thread model. Fixing each one individually is fragile. Process isolation solves all of them at once with zero platform code changes.
+The global state in `host_runtime.so` spans multiple files (`cpu_sim_context.cpp`, `platform_aicpu_affinity.cpp`, `l2_swimlane_collector_aicpu.cpp`, `device_log.cpp`) and is deeply embedded in the AICPU/AICore thread model. Fixing each one individually is fragile. Process isolation solves all of them at once with zero platform code changes.
 
 ## Files
 

diff --git a/docs/testing.md b/docs/testing.md
@@ -104,7 +104,7 @@ python test_xxx.py -p a2a3sim --log-level debug                  # verbose C++ l
 | `--case SEL` | | (all) | Case selector, repeatable: `Foo`, `ClassA::Foo`, `ClassA::` |
 | `--manual` | | `exclude` | `exclude`/`include`/`only` for manual cases |
 | `--skip-golden` | | false | Skip golden comparison (for benchmarking) |
-| `--enable-l2-swimlane [PERF_LEVEL]` | | `0` | Enable L2 swimlane collection on first round only. The flag takes an integer perf_level 0–4 (bare = 4); see [docs/dfx/l2-swimlane-profiling.md](dfx/l2-swimlane-profiling.md#31-enable-l2-swimlane) for the level table. Each test case gets its own `outputs/<case>_<ts>/` directory under which `l2_perf_records.json` lands; parallel runs never collide. |
+| `--enable-l2-swimlane [PERF_LEVEL]` | | `0` | Enable L2 swimlane collection on first round only. The flag takes an integer perf_level 0–4 (bare = 4); see [docs/dfx/l2-swimlane-profiling.md](dfx/l2-swimlane-profiling.md#31-enable-l2-swimlane) for the level table. Each test case gets its own `outputs/<case>_<ts>/` directory under which `l2_swimlane_records.json` lands; parallel runs never collide. |
 | `--dump-tensor` | | false | Dump per-task tensor I/O during runtime execution |
 | `--enable-pmu [EVENT_TYPE]` | | `0` | Enable a2a3 PMU CSV collection. Bare flag selects `PIPE_UTILIZATION` (`2`); pass an event type such as `4` for `MEMORY`. |
 | `--exitfirst` | `-x` | false | Stop on first failing test (fail-fast, primarily for CI) |
@@ -318,13 +318,13 @@ A single file can declare both L2 and L3 classes; they're grouped by `(runtime,
 
 Each test case sets its own `CallConfig.output_prefix` (chosen by `scene_test.py::_build_output_prefix` as `outputs/<ClassName>_<case>_<YYYYMMDD_HHMMSS>/`). The C++ runtime writes all diagnostic artifacts under that prefix with fixed filenames:
 
-- `outputs/<case>_<ts>/l2_perf_records.json` — swimlane (`--enable-l2-swimlane`)
+- `outputs/<case>_<ts>/l2_swimlane_records.json` — swimlane (`--enable-l2-swimlane`)
 - `outputs/<case>_<ts>/tensor_dump/` — tensor dump (`--dump-tensor`)
 - `outputs/<case>_<ts>/pmu.csv` — PMU counters (`--enable-pmu`)
 
 Because each case gets its own directory, parallel runs (xdist workers, L3 case fanout, L2 device fanout) can never collide on filename — there is no per-file timestamp, no env-var scoping, and no post-run flatten step. `CallConfig::validate()` throws if any diagnostic flag is enabled but `output_prefix` is empty; `scene_test.py::run_class_cases` always fills it from the case label.
 
-Standalone invocations of CLIs (`python -m simpler_setup.tools.swimlane_converter`, etc.) auto-detect the latest `outputs/*/l2_perf_records.json` (sorted by mtime); pass `--input <path>` to override.
+Standalone invocations of CLIs (`python -m simpler_setup.tools.swimlane_converter`, etc.) auto-detect the latest `outputs/*/l2_swimlane_records.json` (sorted by mtime); pass `--input <path>` to override.
 
 ### Dispatcher skip conditions (normal pytest runs)