Refactor: AICPU fills AICore rotation table (move from host)#921
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR shifts responsibility for populating an AICore rotation-table from the host to AICPU. The host now allocates the buffer and passes its device address to AICPU via a new setter function. AICPU fills the table with per-core rotation addresses during device initialization, eliminating host-side address translation. ChangesRotation table population coordination
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request decouples the host from the AICore-side shared-memory layout by moving the population of the AICore rotation-table device pointer from the host to the AICPU. The host now only allocates the table and passes its pointer, while the AICPU directly populates it with device-internal addresses during initialization. The review feedback recommends defensively clamping the worker_count parameter read from shared memory to prevent potential out-of-bounds writes or stack overflows, along with logging any clamping events for observability.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cpp (1)
144-164:⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy liftDon't populate the rotation table after AICore can already consume it.
AICore reads
aicore_ring_addr[block_idx]in kernel entry, but this code only fills the table duringl2_perf_aicpu_init(). In sim,src/a2a3/platform/sim/host/device_runner.cppstill starts AICore workers without waiting for that init to finish, andsrc/a2a3/platform/src/host/l2_perf_collector.cppnow leaves the table bytes uninitialized. That means AICore can cache a garbage/nullAicoreRotation*for the whole run. Either keep the pre-launch host fill, or add an explicit init-complete barrier before AICore reads the table.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cpp` around lines 144 - 164, The rotation table (rotation_table / g_platform_aicore_rotation_table) is being filled only in l2_perf_aicpu_init() after AICore may already read aicore_ring_addr[block_idx], causing AICore to see uninitialized pointers; fix by ensuring the table is populated before AICore starts: either restore the pre-launch host-side fill in the host path that launches workers (the code in l2_perf_collector.cpp that used to populate per-core rotation entries) or add an explicit init-complete handshake (e.g., a g_platform_aicore_rotation_table_ready atomic/flag set after filling in l2_perf_aicpu_init() and checked/waited on by AICore startup code that reads aicore_ring_addr in the kernel entry); update get_aicore_buffer_state / l2_perf_aicpu_init to set the flag after writing rotation_table and ensure device_runner/AICore startup waits for that flag before reading aicore_ring_addr.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cpp`:
- Around line 144-164: The rotation table (rotation_table /
g_platform_aicore_rotation_table) is being filled only in l2_perf_aicpu_init()
after AICore may already read aicore_ring_addr[block_idx], causing AICore to see
uninitialized pointers; fix by ensuring the table is populated before AICore
starts: either restore the pre-launch host-side fill in the host path that
launches workers (the code in l2_perf_collector.cpp that used to populate
per-core rotation entries) or add an explicit init-complete handshake (e.g., a
g_platform_aicore_rotation_table_ready atomic/flag set after filling in
l2_perf_aicpu_init() and checked/waited on by AICore startup code that reads
aicore_ring_addr in the kernel entry); update get_aicore_buffer_state /
l2_perf_aicpu_init to set the flag after writing rotation_table and ensure
device_runner/AICore startup waits for that flag before reading
aicore_ring_addr.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d1e03ec4-9cc5-4d34-9873-0a4641e08deb
📒 Files selected for processing (6)
src/a2a3/platform/include/aicpu/l2_perf_collector_aicpu.hsrc/a2a3/platform/onboard/aicpu/kernel.cppsrc/a2a3/platform/sim/host/device_runner.cppsrc/a2a3/platform/sim/host/device_runner.hsrc/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cppsrc/a2a3/platform/src/host/l2_perf_collector.cpp
1a7650e to
c378e9d
Compare
Before: host computed each per-core `&L2PerfAicoreBufferState::rotation` device address (host-to-device translation against `perf_dev_ptr`) and wrote the resulting uint64_t into the `aicore_ring_addr` table inside `L2PerfCollector::initialize`. This coupled host to the AICore-side shared-memory layout — it had to reach into `get_aicore_buffer_state` and translate addresses for a table whose entries are pure device- internal pointers (only AICore reads them). After: AICPU fills the table inside `l2_perf_aicpu_init` reading `&ac_state->rotation` directly (no translation needed — it's already on the device). Host's job shrinks to "alloc the table bytes and hand the device pointer to AICPU via `KernelArgs::aicore_ring_addr`". To make this safe, AICore now defers reading the table until the first task is dispatched (AICPU writes the dispatch register only after init has completed, so by the time AICore reaches into the table, the entries are populated): - AICore kernel entry stashes the *slot pointer* (`&table[block_idx]`) via the new `set_aicore_rotation_slot`, NOT the dereferenced rotation address. Old `set_aicore_rotation` is removed. - `get_aicore_rotation()` lazily dereferences the slot on first call and caches the result. - aicore_executor moves the `get_aicore_rotation()` call from pre-loop init into the per-task branch (still amortized to 0 after first task since the getter caches). Without this deferral, AICore's kernel-entry deref would race with AICPU's `l2_perf_aicpu_init` table fill — observed as a hardware-only `aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018` because AICore stashed a garbage rotation pointer at entry and crashed when later dcci'ing it. The plumbing follows the existing `set_platform_l2_perf_base` pattern: - New `set_platform_aicore_rotation_table` / `get_*` setters on AICPU - On onboard: AICPU `kernel.cpp` calls the setter from `k_args` - On sim: host dlsyms the setter and calls it before triggering AICPU a2a3 only — a5 uses an independent dev alloc for each AICore ring (`state->aicore_ring_ptr`), so the host already has the dev address and the translation issue doesn't exist there. Tested: a2a3 + a2a3sim build clean; tests/st/.../l2_swimlane pass on both a2a3sim and a2a3 device 1 (test_l2_swimlane + test_l2_swimlane_mixed).
A holistic pass over the L2 perf type names in src/{a2a3,a5}/platform/.
Renames consolidate the misleading parts and drop two fields that have
been dead since hw-native-sys#878 / hw-native-sys#921. No on-device behaviour change beyond the
field removals.
== Step 1: buffer name symmetry ==
L2PerfBuffer -> L2PerfAicpuBuffer
PhaseBuffer -> AicpuPhaseBuffer
The producer is now in the name: AICPU-written buffers carry `Aicpu`,
mirroring the existing `L2PerfAicoreBuffer` for the AICore-written one.
On a2a3, AicpuPhaseBuffer also moves from a hand-written struct to the
existing `TypedBuffer<R, N>` template -- matches L2PerfAicpuBuffer and
L2PerfAicoreBuffer.
== Step 4: pool struct rename ==
AicpuBufferState -> L2PerfAicpuPool
L2PerfAicoreBufferState -> L2PerfAicorePool
PhaseBufferState -> AicpuPhasePool (alias of L2PerfAicpuPool)
Replaces the misleading `BufferState` suffix (it was a *pool* state
holding free_queue + counters, not a buffer) and gives the Phase usage
its own explicit type name even though it shares layout with
L2PerfAicpuPool.
== Step 2: BufferKind enum ==
ReadyQueueEntry::is_phase (uint32_t) -> kind (BufferKind, uint32_t underlying)
enum class BufferKind : uint32_t {
L2PerfAicpu = 0,
Phase = 1,
L2PerfAicore = 2,
};
The field was already a 3-value discriminator (despite the bool-shaped
name); enqueue sites and the host's switch site now read in terms of
the named enum instead of magic numbers. uint32_t underlying preserves
the byte layout -- no ABI break.
== Step 3: drop dead fields (a2a3 only) ==
L2PerfAicpuPool::aicore_ring_ptr -- removed
L2PerfAicpuPool::mismatch_record_count -- removed
Both fields have been dead since:
- hw-native-sys#878 moved AICore writes to a per-core L2PerfAicoreBuffer pool with
its own state (L2PerfAicorePool), so `aicore_ring_ptr` no longer
points anywhere live.
- hw-native-sys#921 moved the rotation-table fill from host to AICPU, removing the
last code path that even read `aicore_ring_ptr` (host's flush join
now indexes the AICore pool directly).
- `mismatch_record_count` was last written before hw-native-sys#878 (legacy ring/
task_id invariant violation) and is no longer written by anyone;
the host reconcile() loop's reading-and-warning on a counter that
can never advance is removed.
Pad is adjusted to keep `sizeof(L2PerfAicpuPool) == 192` (static_assert
remains green). Host reconcile() loop simplifies to `total/dropped` only.
a5 keeps both fields -- its AICore pipeline still uses the legacy
staging-ring design where the host's per-core `aicore_ring_ptr` and the
mismatch counter are both live.
== Step 5: deferred ==
Collapsing AicpuPhaseHeader into the root L2PerfDataHeader is a
shared-memory layout change and is left for a focused follow-up PR.
== Test plan ==
- All four platform variants (a2a3, a2a3sim, a5, a5sim) build clean
- tests/st/.../l2_swimlane pass on a2a3sim
- Same ST passes on a2a3 device 1
- a5sim smoke (spmd_basic) passes
A holistic pass over the L2 perf type names in src/{a2a3,a5}/platform/.
Renames consolidate the misleading parts and drop two fields that have
been dead since hw-native-sys#878 / hw-native-sys#921. No on-device behaviour change beyond the
field removals.
== Step 1: buffer name symmetry ==
L2PerfBuffer -> L2PerfAicpuBuffer
PhaseBuffer -> AicpuPhaseBuffer
The producer is now in the name: AICPU-written buffers carry `Aicpu`,
mirroring the existing `L2PerfAicoreBuffer` for the AICore-written one.
On a2a3, AicpuPhaseBuffer also moves from a hand-written struct to the
existing `TypedBuffer<R, N>` template -- matches L2PerfAicpuBuffer and
L2PerfAicoreBuffer.
== Step 4: pool struct rename ==
AicpuBufferState -> L2PerfAicpuPool
L2PerfAicoreBufferState -> L2PerfAicorePool
PhaseBufferState -> AicpuPhasePool (alias of L2PerfAicpuPool)
Replaces the misleading `BufferState` suffix (it was a *pool* state
holding free_queue + counters, not a buffer) and gives the Phase usage
its own explicit type name even though it shares layout with
L2PerfAicpuPool.
== Step 2: BufferKind enum ==
ReadyQueueEntry::is_phase (uint32_t) -> kind (BufferKind, uint32_t underlying)
enum class BufferKind : uint32_t {
L2PerfAicpu = 0,
Phase = 1,
L2PerfAicore = 2,
};
The field was already a 3-value discriminator (despite the bool-shaped
name); enqueue sites and the host's switch site now read in terms of
the named enum instead of magic numbers. uint32_t underlying preserves
the byte layout -- no ABI break.
== Step 3: drop dead fields (a2a3 only) ==
L2PerfAicpuPool::aicore_ring_ptr -- removed
L2PerfAicpuPool::mismatch_record_count -- removed
Both fields have been dead since:
- hw-native-sys#878 moved AICore writes to a per-core L2PerfAicoreBuffer pool with
its own state (L2PerfAicorePool), so `aicore_ring_ptr` no longer
points anywhere live.
- hw-native-sys#921 moved the rotation-table fill from host to AICPU, removing the
last code path that even read `aicore_ring_ptr` (host's flush join
now indexes the AICore pool directly).
- `mismatch_record_count` was last written before hw-native-sys#878 (legacy ring/
task_id invariant violation) and is no longer written by anyone;
the host reconcile() loop's reading-and-warning on a counter that
can never advance is removed.
Pad is adjusted to keep `sizeof(L2PerfAicpuPool) == 192` (static_assert
remains green). Host reconcile() loop simplifies to `total/dropped` only.
a5 keeps both fields -- its AICore pipeline still uses the legacy
staging-ring design where the host's per-core `aicore_ring_ptr` and the
mismatch counter are both live.
== Step 5: deferred ==
Collapsing AicpuPhaseHeader into the root L2PerfDataHeader is a
shared-memory layout change and is left for a focused follow-up PR.
== Test plan ==
- All four platform variants (a2a3, a2a3sim, a5, a5sim) build clean
- tests/st/.../l2_swimlane pass on a2a3sim
- Same ST passes on a2a3 device 1
- a5sim smoke (spmd_basic) passes
A holistic naming pass over the L2 swimlane subsystem:
1. Type names get a consistent `L2Swimlane{Writer}{Kind}{Layer}` shape.
2. Pool struct rename (BufferState → Pool) — captures the actual
semantic (a pool of buffers, not "a buffer's state").
3. ReadyQueueEntry::is_phase (uint32_t magic) becomes
L2SwimlaneBufferKind enum class (uint32_t underlying — no ABI break).
4. a2a3 only: drop L2SwimlaneAicpuTaskPool::aicore_ring_ptr and
mismatch_record_count, both dead since hw-native-sys#878 / hw-native-sys#921.
5. File names l2_perf_*.{h,cpp} renamed to l2_swimlane_*.{h,cpp}.
== Renaming scheme ==
Records / buffers / pools follow `L2Swimlane{Writer}{Kind}{Layer}`:
L2PerfRecord -> L2SwimlaneAicpuTaskRecord
L2PerfAicoreRecord -> L2SwimlaneAicoreTaskRecord
AicpuPhaseRecord -> L2SwimlaneAicpuPhaseRecord
L2PerfBuffer -> L2SwimlaneAicpuTaskBuffer
L2PerfAicoreBuffer -> L2SwimlaneAicoreTaskBuffer
PhaseBuffer -> L2SwimlaneAicpuPhaseBuffer
L2PerfBufferState -> L2SwimlaneAicpuTaskPool
L2PerfAicoreBufferState -> L2SwimlaneAicoreTaskPool
PhaseBufferState -> L2SwimlaneAicpuPhasePool
L2PerfDataHeader -> L2SwimlaneDataHeader
L2PerfLevel -> L2SwimlaneLevel
L2PerfFreeQueue -> L2SwimlaneFreeQueue
L2PerfModule -> L2SwimlaneModule
L2PerfCollector -> L2SwimlaneCollector
AicpuPhaseId -> L2SwimlaneAicpuPhaseId
AicpuPhaseHeader -> L2SwimlaneAicpuPhaseHeader
AicoreRotation -> L2SwimlaneAicoreRotation
AicoreLocalState -> L2SwimlaneAicoreLocalState
L2PerfAicoreRing (a5) -> L2SwimlaneAicoreRing
Enums (uint32_t underlying — wire format preserved):
enum class L2SwimlaneBufferKind {
AicpuTask = 0,
AicpuPhase = 1,
AicoreTask = 2,
};
enum class ProfBufferType { AICPU_TASK = 0, AICPU_PHASE = 1, AICORE_TASK = 2 };
Functions / globals / statics: l2_perf_* -> l2_swimlane_*,
complete_record -> complete_task, flush_buffers -> flush.
KernelArgs cross-platform fields:
l2_perf_data_base -> l2_swimlane_base
aicore_ring_addr -> l2_swimlane_aicore_rotation_table
Files:
include/{common,host,aicpu,aicore}/l2_perf_*.{h,cpp}
-> include/{common,host,aicpu,aicore}/l2_swimlane_*.{h,cpp}
Output artifact: l2_perf_records.json -> l2_swimlane_records.json
== Dead-field drop (a2a3 only) ==
L2SwimlaneAicpuTaskPool::aicore_ring_ptr — removed
L2SwimlaneAicpuTaskPool::mismatch_record_count — removed
Both fields have been dead since:
- hw-native-sys#878 moved AICore writes to a per-core L2SwimlaneAicoreTaskBuffer
pool with its own state (L2SwimlaneAicoreTaskPool).
- hw-native-sys#921 moved the rotation-table fill from host to AICPU.
- mismatch_record_count was last written before hw-native-sys#878.
Pad adjusted to keep sizeof(L2SwimlaneAicpuTaskPool) == 192 (static_assert
remains green). Host reconcile() loop simplifies to total/dropped only.
a5 keeps both fields — its AICore pipeline still uses the legacy
staging-ring design where they are live.
== Behaviour ==
Pure rename plus the dead-field drop. No struct layout change beyond
removing 12 bytes that nobody read or wrote. uint32_t-underlying enums
preserve wire format.
== Test plan ==
- All four platform variants (a2a3, a2a3sim, a5, a5sim) build clean
- tests/st/.../l2_swimlane pass on a2a3sim
- Same ST passes on a2a3 device 1
- a5sim smoke (spmd_basic) passes
A holistic naming pass over the L2 swimlane subsystem:
1. Type names get a consistent `L2Swimlane{Writer}{Kind}{Layer}` shape.
2. Pool struct rename (BufferState → Pool) — captures the actual
semantic (a pool of buffers, not "a buffer's state").
3. ReadyQueueEntry::is_phase (uint32_t magic) becomes
L2SwimlaneBufferKind enum class (uint32_t underlying — no ABI break).
4. a2a3 only: drop L2SwimlaneAicpuTaskPool::aicore_ring_ptr and
mismatch_record_count, both dead since hw-native-sys#878 / hw-native-sys#921.
5. File names l2_perf_*.{h,cpp} renamed to l2_swimlane_*.{h,cpp}.
== Renaming scheme ==
Records / buffers / pools follow `L2Swimlane{Writer}{Kind}{Layer}`:
L2PerfRecord -> L2SwimlaneAicpuTaskRecord
L2PerfAicoreRecord -> L2SwimlaneAicoreTaskRecord
AicpuPhaseRecord -> L2SwimlaneAicpuPhaseRecord
L2PerfBuffer -> L2SwimlaneAicpuTaskBuffer
L2PerfAicoreBuffer -> L2SwimlaneAicoreTaskBuffer
PhaseBuffer -> L2SwimlaneAicpuPhaseBuffer
L2PerfBufferState -> L2SwimlaneAicpuTaskPool
L2PerfAicoreBufferState -> L2SwimlaneAicoreTaskPool
PhaseBufferState -> L2SwimlaneAicpuPhasePool
L2PerfDataHeader -> L2SwimlaneDataHeader
L2PerfLevel -> L2SwimlaneLevel
L2PerfFreeQueue -> L2SwimlaneFreeQueue
L2PerfModule -> L2SwimlaneModule
L2PerfCollector -> L2SwimlaneCollector
AicpuPhaseId -> L2SwimlaneAicpuPhaseId
AicpuPhaseHeader -> L2SwimlaneAicpuPhaseHeader
AicoreRotation -> L2SwimlaneAicoreRotation
AicoreLocalState -> L2SwimlaneAicoreLocalState
L2PerfAicoreRing (a5) -> L2SwimlaneAicoreRing
Enums (uint32_t underlying — wire format preserved):
enum class L2SwimlaneBufferKind {
AicpuTask = 0,
AicpuPhase = 1,
AicoreTask = 2,
};
enum class ProfBufferType { AICPU_TASK = 0, AICPU_PHASE = 1, AICORE_TASK = 2 };
Functions / globals / statics: l2_perf_* -> l2_swimlane_*,
complete_record -> complete_task, flush_buffers -> flush.
KernelArgs cross-platform fields:
l2_perf_data_base -> l2_swimlane_base
aicore_ring_addr -> l2_swimlane_aicore_rotation_table
Files:
include/{common,host,aicpu,aicore}/l2_perf_*.{h,cpp}
-> include/{common,host,aicpu,aicore}/l2_swimlane_*.{h,cpp}
Output artifact: l2_perf_records.json -> l2_swimlane_records.json
== Dead-field drop (a2a3 only) ==
L2SwimlaneAicpuTaskPool::aicore_ring_ptr — removed
L2SwimlaneAicpuTaskPool::mismatch_record_count — removed
Both fields have been dead since:
- hw-native-sys#878 moved AICore writes to a per-core L2SwimlaneAicoreTaskBuffer
pool with its own state (L2SwimlaneAicoreTaskPool).
- hw-native-sys#921 moved the rotation-table fill from host to AICPU.
- mismatch_record_count was last written before hw-native-sys#878.
Pad adjusted to keep sizeof(L2SwimlaneAicpuTaskPool) == 192 (static_assert
remains green). Host reconcile() loop simplifies to total/dropped only.
a5 keeps both fields — its AICore pipeline still uses the legacy
staging-ring design where they are live.
== Behaviour ==
Pure rename plus the dead-field drop. No struct layout change beyond
removing 12 bytes that nobody read or wrote. uint32_t-underlying enums
preserve wire format.
== Test plan ==
- All four platform variants (a2a3, a2a3sim, a5, a5sim) build clean
- tests/st/.../l2_swimlane pass on a2a3sim
- Same ST passes on a2a3 device 1
- a5sim smoke (spmd_basic) passes
A holistic naming pass over the L2 swimlane subsystem:
1. Type names get a consistent `L2Swimlane{Writer}{Kind}{Layer}` shape.
2. Pool struct rename (BufferState → Pool) — captures the actual
semantic (a pool of buffers, not "a buffer's state").
3. ReadyQueueEntry::is_phase (uint32_t magic) becomes
L2SwimlaneBufferKind enum class (uint32_t underlying — no ABI break).
4. a2a3 only: drop L2SwimlaneAicpuTaskPool::aicore_ring_ptr and
mismatch_record_count, both dead since hw-native-sys#878 / hw-native-sys#921.
5. File names l2_perf_*.{h,cpp} renamed to l2_swimlane_*.{h,cpp}.
== Renaming scheme ==
Records / buffers / pools follow `L2Swimlane{Writer}{Kind}{Layer}`:
L2PerfRecord -> L2SwimlaneAicpuTaskRecord
L2PerfAicoreRecord -> L2SwimlaneAicoreTaskRecord
AicpuPhaseRecord -> L2SwimlaneAicpuPhaseRecord
L2PerfBuffer -> L2SwimlaneAicpuTaskBuffer
L2PerfAicoreBuffer -> L2SwimlaneAicoreTaskBuffer
PhaseBuffer -> L2SwimlaneAicpuPhaseBuffer
L2PerfBufferState -> L2SwimlaneAicpuTaskPool
L2PerfAicoreBufferState -> L2SwimlaneAicoreTaskPool
PhaseBufferState -> L2SwimlaneAicpuPhasePool
L2PerfDataHeader -> L2SwimlaneDataHeader
L2PerfLevel -> L2SwimlaneLevel
L2PerfFreeQueue -> L2SwimlaneFreeQueue
L2PerfModule -> L2SwimlaneModule
L2PerfCollector -> L2SwimlaneCollector
AicpuPhaseId -> L2SwimlaneAicpuPhaseId
AicpuPhaseHeader -> L2SwimlaneAicpuPhaseHeader
AicoreRotation -> L2SwimlaneAicoreRotation
AicoreLocalState -> L2SwimlaneAicoreLocalState
L2PerfAicoreRing (a5) -> L2SwimlaneAicoreRing
Enums (uint32_t underlying — wire format preserved):
enum class L2SwimlaneBufferKind {
AicpuTask = 0,
AicpuPhase = 1,
AicoreTask = 2,
};
enum class ProfBufferType { AICPU_TASK = 0, AICPU_PHASE = 1, AICORE_TASK = 2 };
Functions / globals / statics: l2_perf_* -> l2_swimlane_*,
complete_record -> complete_task, flush_buffers -> flush.
KernelArgs cross-platform fields:
l2_perf_data_base -> l2_swimlane_base
aicore_ring_addr -> l2_swimlane_aicore_rotation_table
Files:
include/{common,host,aicpu,aicore}/l2_perf_*.{h,cpp}
-> include/{common,host,aicpu,aicore}/l2_swimlane_*.{h,cpp}
Output artifact: l2_perf_records.json -> l2_swimlane_records.json
== Dead-field drop (a2a3 only) ==
L2SwimlaneAicpuTaskPool::aicore_ring_ptr — removed
L2SwimlaneAicpuTaskPool::mismatch_record_count — removed
Both fields have been dead since:
- hw-native-sys#878 moved AICore writes to a per-core L2SwimlaneAicoreTaskBuffer
pool with its own state (L2SwimlaneAicoreTaskPool).
- hw-native-sys#921 moved the rotation-table fill from host to AICPU.
- mismatch_record_count was last written before hw-native-sys#878.
Pad adjusted to keep sizeof(L2SwimlaneAicpuTaskPool) == 192 (static_assert
remains green). Host reconcile() loop simplifies to total/dropped only.
a5 keeps both fields — its AICore pipeline still uses the legacy
staging-ring design where they are live.
== Behaviour ==
Pure rename plus the dead-field drop. No struct layout change beyond
removing 12 bytes that nobody read or wrote. uint32_t-underlying enums
preserve wire format.
== Test plan ==
- All four platform variants (a2a3, a2a3sim, a5, a5sim) build clean
- tests/st/.../l2_swimlane pass on a2a3sim
- Same ST passes on a2a3 device 1
- a5sim smoke (spmd_basic) passes
A holistic naming pass over the L2 swimlane subsystem:
1. Type names get a consistent `L2Swimlane{Writer}{Kind}{Layer}` shape.
2. Pool struct rename (BufferState → Pool) — captures the actual
semantic (a pool of buffers, not "a buffer's state").
3. ReadyQueueEntry::is_phase (uint32_t magic) becomes
L2SwimlaneBufferKind enum class (uint32_t underlying — no ABI break).
4. a2a3 only: drop L2SwimlaneAicpuTaskPool::aicore_ring_ptr and
mismatch_record_count, both dead since hw-native-sys#878 / hw-native-sys#921.
5. File names l2_perf_*.{h,cpp} renamed to l2_swimlane_*.{h,cpp}.
== Renaming scheme ==
Records / buffers / pools follow `L2Swimlane{Writer}{Kind}{Layer}`:
L2PerfRecord -> L2SwimlaneAicpuTaskRecord
L2PerfAicoreRecord -> L2SwimlaneAicoreTaskRecord
AicpuPhaseRecord -> L2SwimlaneAicpuPhaseRecord
L2PerfBuffer -> L2SwimlaneAicpuTaskBuffer
L2PerfAicoreBuffer -> L2SwimlaneAicoreTaskBuffer
PhaseBuffer -> L2SwimlaneAicpuPhaseBuffer
L2PerfBufferState -> L2SwimlaneAicpuTaskPool
L2PerfAicoreBufferState -> L2SwimlaneAicoreTaskPool
PhaseBufferState -> L2SwimlaneAicpuPhasePool
L2PerfDataHeader -> L2SwimlaneDataHeader
L2PerfLevel -> L2SwimlaneLevel
L2PerfFreeQueue -> L2SwimlaneFreeQueue
L2PerfModule -> L2SwimlaneModule
L2PerfCollector -> L2SwimlaneCollector
AicpuPhaseId -> L2SwimlaneAicpuPhaseId
AicpuPhaseHeader -> L2SwimlaneAicpuPhaseHeader
AicoreRotation -> L2SwimlaneAicoreRotation
AicoreLocalState -> L2SwimlaneAicoreLocalState
L2PerfAicoreRing (a5) -> L2SwimlaneAicoreRing
Enums (uint32_t underlying — wire format preserved):
enum class L2SwimlaneBufferKind {
AicpuTask = 0,
AicpuPhase = 1,
AicoreTask = 2,
};
enum class ProfBufferType { AICPU_TASK = 0, AICPU_PHASE = 1, AICORE_TASK = 2 };
Functions / globals / statics: l2_perf_* -> l2_swimlane_*,
complete_record -> complete_task, flush_buffers -> flush.
KernelArgs cross-platform fields:
l2_perf_data_base -> l2_swimlane_base
aicore_ring_addr -> l2_swimlane_aicore_rotation_table
Files:
include/{common,host,aicpu,aicore}/l2_perf_*.{h,cpp}
-> include/{common,host,aicpu,aicore}/l2_swimlane_*.{h,cpp}
Output artifact: l2_perf_records.json -> l2_swimlane_records.json
== Dead-field drop (a2a3 only) ==
L2SwimlaneAicpuTaskPool::aicore_ring_ptr — removed
L2SwimlaneAicpuTaskPool::mismatch_record_count — removed
Both fields have been dead since:
- hw-native-sys#878 moved AICore writes to a per-core L2SwimlaneAicoreTaskBuffer
pool with its own state (L2SwimlaneAicoreTaskPool).
- hw-native-sys#921 moved the rotation-table fill from host to AICPU.
- mismatch_record_count was last written before hw-native-sys#878.
Pad adjusted to keep sizeof(L2SwimlaneAicpuTaskPool) == 192 (static_assert
remains green). Host reconcile() loop simplifies to total/dropped only.
a5 keeps both fields — its AICore pipeline still uses the legacy
staging-ring design where they are live.
== Behaviour ==
Pure rename plus the dead-field drop. No struct layout change beyond
removing 12 bytes that nobody read or wrote. uint32_t-underlying enums
preserve wire format.
== Test plan ==
- All four platform variants (a2a3, a2a3sim, a5, a5sim) build clean
- tests/st/.../l2_swimlane pass on a2a3sim
- Same ST passes on a2a3 device 1
- a5sim smoke (spmd_basic) passes
A holistic naming pass over the L2 swimlane subsystem:
1. Type names get a consistent `L2Swimlane{Writer}{Kind}{Layer}` shape.
2. Pool struct rename (BufferState → Pool) — captures the actual
semantic (a pool of buffers, not "a buffer's state").
3. ReadyQueueEntry::is_phase (uint32_t magic) becomes
L2SwimlaneBufferKind enum class (uint32_t underlying — no ABI break).
4. a2a3 only: drop L2SwimlaneAicpuTaskPool::aicore_ring_ptr and
mismatch_record_count, both dead since hw-native-sys#878 / hw-native-sys#921.
5. File names l2_perf_*.{h,cpp} renamed to l2_swimlane_*.{h,cpp}.
== Renaming scheme ==
Records / buffers / pools follow `L2Swimlane{Writer}{Kind}{Layer}`:
L2PerfRecord -> L2SwimlaneAicpuTaskRecord
L2PerfAicoreRecord -> L2SwimlaneAicoreTaskRecord
AicpuPhaseRecord -> L2SwimlaneAicpuPhaseRecord
L2PerfBuffer -> L2SwimlaneAicpuTaskBuffer
L2PerfAicoreBuffer -> L2SwimlaneAicoreTaskBuffer
PhaseBuffer -> L2SwimlaneAicpuPhaseBuffer
L2PerfBufferState -> L2SwimlaneAicpuTaskPool
L2PerfAicoreBufferState -> L2SwimlaneAicoreTaskPool
PhaseBufferState -> L2SwimlaneAicpuPhasePool
L2PerfDataHeader -> L2SwimlaneDataHeader
L2PerfLevel -> L2SwimlaneLevel
L2PerfFreeQueue -> L2SwimlaneFreeQueue
L2PerfModule -> L2SwimlaneModule
L2PerfCollector -> L2SwimlaneCollector
AicpuPhaseId -> L2SwimlaneAicpuPhaseId
AicpuPhaseHeader -> L2SwimlaneAicpuPhaseHeader
AicoreRotation -> L2SwimlaneAicoreRotation
AicoreLocalState -> L2SwimlaneAicoreLocalState
L2PerfAicoreRing (a5) -> L2SwimlaneAicoreRing
Enums (uint32_t underlying — wire format preserved):
enum class L2SwimlaneBufferKind {
AicpuTask = 0,
AicpuPhase = 1,
AicoreTask = 2,
};
enum class ProfBufferType { AICPU_TASK = 0, AICPU_PHASE = 1, AICORE_TASK = 2 };
Functions / globals / statics: l2_perf_* -> l2_swimlane_*,
complete_record -> complete_task, flush_buffers -> flush.
KernelArgs cross-platform fields:
l2_perf_data_base -> l2_swimlane_base
aicore_ring_addr -> l2_swimlane_aicore_rotation_table
Files:
include/{common,host,aicpu,aicore}/l2_perf_*.{h,cpp}
-> include/{common,host,aicpu,aicore}/l2_swimlane_*.{h,cpp}
Output artifact: l2_perf_records.json -> l2_swimlane_records.json
== Dead-field drop (a2a3 only) ==
L2SwimlaneAicpuTaskPool::aicore_ring_ptr — removed
L2SwimlaneAicpuTaskPool::mismatch_record_count — removed
Both fields have been dead since:
- hw-native-sys#878 moved AICore writes to a per-core L2SwimlaneAicoreTaskBuffer
pool with its own state (L2SwimlaneAicoreTaskPool).
- hw-native-sys#921 moved the rotation-table fill from host to AICPU.
- mismatch_record_count was last written before hw-native-sys#878.
Pad adjusted to keep sizeof(L2SwimlaneAicpuTaskPool) == 192 (static_assert
remains green). Host reconcile() loop simplifies to total/dropped only.
a5 keeps both fields — its AICore pipeline still uses the legacy
staging-ring design where they are live.
== Behaviour ==
Pure rename plus the dead-field drop. No struct layout change beyond
removing 12 bytes that nobody read or wrote. uint32_t-underlying enums
preserve wire format.
== Test plan ==
- All four platform variants (a2a3, a2a3sim, a5, a5sim) build clean
- tests/st/.../l2_swimlane pass on a2a3sim
- Same ST passes on a2a3 device 1
- a5sim smoke (spmd_basic) passes
A holistic naming pass over the L2 swimlane subsystem:
1. Type names get a consistent `L2Swimlane{Writer}{Kind}{Layer}` shape.
2. Pool struct rename (BufferState → Pool) — captures the actual
semantic (a pool of buffers, not "a buffer's state").
3. ReadyQueueEntry::is_phase (uint32_t magic) becomes
L2SwimlaneBufferKind enum class (uint32_t underlying — no ABI break).
4. a2a3 only: drop L2SwimlaneAicpuTaskPool::aicore_ring_ptr and
mismatch_record_count, both dead since #878 / #921.
5. File names l2_perf_*.{h,cpp} renamed to l2_swimlane_*.{h,cpp}.
== Renaming scheme ==
Records / buffers / pools follow `L2Swimlane{Writer}{Kind}{Layer}`:
L2PerfRecord -> L2SwimlaneAicpuTaskRecord
L2PerfAicoreRecord -> L2SwimlaneAicoreTaskRecord
AicpuPhaseRecord -> L2SwimlaneAicpuPhaseRecord
L2PerfBuffer -> L2SwimlaneAicpuTaskBuffer
L2PerfAicoreBuffer -> L2SwimlaneAicoreTaskBuffer
PhaseBuffer -> L2SwimlaneAicpuPhaseBuffer
L2PerfBufferState -> L2SwimlaneAicpuTaskPool
L2PerfAicoreBufferState -> L2SwimlaneAicoreTaskPool
PhaseBufferState -> L2SwimlaneAicpuPhasePool
L2PerfDataHeader -> L2SwimlaneDataHeader
L2PerfLevel -> L2SwimlaneLevel
L2PerfFreeQueue -> L2SwimlaneFreeQueue
L2PerfModule -> L2SwimlaneModule
L2PerfCollector -> L2SwimlaneCollector
AicpuPhaseId -> L2SwimlaneAicpuPhaseId
AicpuPhaseHeader -> L2SwimlaneAicpuPhaseHeader
AicoreRotation -> L2SwimlaneAicoreRotation
AicoreLocalState -> L2SwimlaneAicoreLocalState
L2PerfAicoreRing (a5) -> L2SwimlaneAicoreRing
Enums (uint32_t underlying — wire format preserved):
enum class L2SwimlaneBufferKind {
AicpuTask = 0,
AicpuPhase = 1,
AicoreTask = 2,
};
enum class ProfBufferType { AICPU_TASK = 0, AICPU_PHASE = 1, AICORE_TASK = 2 };
Functions / globals / statics: l2_perf_* -> l2_swimlane_*,
complete_record -> complete_task, flush_buffers -> flush.
KernelArgs cross-platform fields:
l2_perf_data_base -> l2_swimlane_base
aicore_ring_addr -> l2_swimlane_aicore_rotation_table
Files:
include/{common,host,aicpu,aicore}/l2_perf_*.{h,cpp}
-> include/{common,host,aicpu,aicore}/l2_swimlane_*.{h,cpp}
Output artifact: l2_perf_records.json -> l2_swimlane_records.json
== Dead-field drop (a2a3 only) ==
L2SwimlaneAicpuTaskPool::aicore_ring_ptr — removed
L2SwimlaneAicpuTaskPool::mismatch_record_count — removed
Both fields have been dead since:
- #878 moved AICore writes to a per-core L2SwimlaneAicoreTaskBuffer
pool with its own state (L2SwimlaneAicoreTaskPool).
- #921 moved the rotation-table fill from host to AICPU.
- mismatch_record_count was last written before #878.
Pad adjusted to keep sizeof(L2SwimlaneAicpuTaskPool) == 192 (static_assert
remains green). Host reconcile() loop simplifies to total/dropped only.
a5 keeps both fields — its AICore pipeline still uses the legacy
staging-ring design where they are live.
== Behaviour ==
Pure rename plus the dead-field drop. No struct layout change beyond
removing 12 bytes that nobody read or wrote. uint32_t-underlying enums
preserve wire format.
== Test plan ==
- All four platform variants (a2a3, a2a3sim, a5, a5sim) build clean
- tests/st/.../l2_swimlane pass on a2a3sim
- Same ST passes on a2a3 device 1
- a5sim smoke (spmd_basic) passes
Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Summary
Move the AICore rotation-table filling step from host into AICPU.
The table is a flat `uint64_t[num_aicore]` whose entries are per-core
`&L2PerfAicoreBufferState::rotation` device addresses. AICore reads
`aicore_ring_addr[block_idx]` from KernelArgs to find its rotation
channel.
Before
Host did the fill (a2a3 `l2_perf_collector.cpp::initialize`):
```cpp
auto host_to_dev = [&](void *host_addr) -> uint64_t {
uintptr_t offset = (uintptr_t)host_addr - (uintptr_t)perf_host_ptr;
return (uint64_t)perf_dev_ptr + offset;
};
for (int i = 0; i < num_aicore; i++) {
L2PerfAicoreBufferState *ac_state = get_aicore_buffer_state(...);
rotation_table[i] = host_to_dev(&ac_state->rotation);
}
```
This coupled host to the AICore-side shared-memory layout: had to call
`get_aicore_buffer_state` and translate host pointers to device
pointers, for a table whose entries are pure device-internal addresses
(only AICore reads them).
After
Host allocates the table bytes and hands the device pointer to AICPU
via the existing `KernelArgs::aicore_ring_addr`. AICPU fills the
table inside `l2_perf_aicpu_init` reading
`&ac_state->rotation` directly — no translation needed since AICPU
already runs on the device.
Plumbing follows the existing `set_platform_l2_perf_base` pattern:
Scope
a2a3 only. a5 uses an independent dev alloc for each AICore ring
(`state->aicore_ring_ptr`), so the host already has the dev address at
alloc time and the translation issue doesn't exist there.
Why
This is part of a holistic L2 perf refactor (see PR #916 for the
`L2PerfBufferState` → `AicpuBufferState` rename). The principle: host
should know how many bytes to allocate and where to hand them to AICPU;
it should not have to know that `AicoreRotation` is a field of
`L2PerfAicoreBufferState` at offset Y. After this PR, host's only
contact with the rotation channel is "alloc the address table + put the
table ptr in KernelArgs". The rotation protocol itself stays purely
between AICPU (writer) and AICore (reader).
Test plan
`test_l2_swimlane_mixed`) pass on `a2a3sim`
Independent of #916.