Refactor: unify L2 swimlane pools around L2SwimlaneActiveHead cache line by hw-native-sys-bot · Pull Request #939 · hw-native-sys/simpler

hw-native-sys-bot · 2026-05-31T06:36:21Z

Summary

Three pool types (AICPU task, AICPU phase, AICore task) previously had ad-hoc layouts. AICPU/Phase pool was 192B (free_queue + scattered counters + pad); AICore pool was 256B (rotation + free_queue + counters + pad).

This PR extracts the common producer-side cache line as L2SwimlaneActiveHead (64B: current_buf_ptr + current_buf_seq + total/dropped record counts). Every pool is now exactly head + free_queue = 192B. The standalone L2SwimlaneAicoreRotation struct is deleted — its old fields merge into the head's current_buf_ptr + current_buf_seq (semantics identical: AICore reads seq to detect a rotation, then reads ptr to pick up the new buffer).

Savings: 64B per AICore core (256→192). Removes the duplicated "buf_seq / generation" counter pair that was bumped twice per rotation.

Not a pure rename: aicore_rotate publish ordering strengthened

The new publish sequence in aicore_rotate adds an explicit wmb() between the ptr write and the seq write:

// New (3 wmb)
wmb();
ac_state->head.current_buf_ptr = new_buf_ptr;
wmb();   // ← new: ordering fence between the two volatile stores
ac_state->head.current_buf_seq = seq + 1;
wmb();

The old code had two adjacent volatile stores (rotation.current_buf_ptr then rotation.generation) under a single trailing wmb(). On ARM/aarch64's weak memory model this leaves room for AICore's dcci+read to observe the new seq with a stale ptr. The added fence closes that window.

The init-prime path uses the same publish pattern now (ptr → wmb → seq → wmb), for symmetry and future-proofing against changes that let AICore poll before the first dispatch.

Pre-existing bug fix: drop the pre-emptive `dropped += BUFFER_SIZE`

When aicore_rotate failed (free queue empty OR ready queue full) the old code pre-emptively bumped dropped_record_count += PLATFORM_AICORE_BUFFER_SIZE. That over-counted whenever:

The OLD buffer still made it to the host via the flush retry path → collected += BUFFER_SIZE and dropped += BUFFER_SIZE for the same records, breaking the collected + dropped == total reconcile invariant.
The run ended before AICore actually overflowed the slot guard for a full BUFFER_SIZE more tasks → the pre-emptive drop was larger than the real loss.

Fix: remove the pre-emptive bump in both failure branches. The slot guard on AICore silently drops further records (which AICPU has no precise count of), and reconcile reports the gap as silent_loss = total - collected - dropped. The ready-queue-full branch additionally now lets the run-end flush retry the enqueue (the host may have drained by then) rather than pre-counting the records as lost.

A regression test exercising forced rotation failures is followed up in a separate issue (the trigger is hard to set up in the existing 5-task vector example).

Hot-path verification

AICore still dcci(head, SINGLE_CACHE_LINE) — head fits one cache line by alignas(64), so per-task cost is unchanged.
False-sharing audit: head is single-writer (AICPU) for all three pools; readers are AICore (dcci, AicoreTask pool only) or host at drain time. AICore never reads counter fields, so the invalidation it pays per task is harmless even though counters cohabit the line.
offsetof static_asserts added on both pool types to lock head@0 / free_queue@64 — silent layout drift would corrupt the AICore-readable head address the AICPU init publishes into the rotation table.

A perf measurement (paged_attention_unroll --enable-l2-swimlane 4 vs main) is followed up in a separate issue to validate that the cache-line sharing of head + counters doesn't regress AICore dcci throughput.

Renames (atomic with the layout change — no compat shim)

Old	New
`set_l2_swimlane_aicore_rotation_slot`	`set_l2_swimlane_aicore_head_slot`
`get_l2_swimlane_aicore_rotation`	`get_l2_swimlane_aicore_head`
`L2SwimlaneAicoreRotation`	`L2SwimlaneActiveHead`
`L2SwimlaneAicoreLocalState::cached_generation`	`cached_buf_seq`

KernelArgs::l2_swimlane_aicore_rotation_table field name preserved for ABI stability; its comment now notes that slots hold L2SwimlaneActiveHead* addresses.

Init shift (semantically equivalent)

AicoreTask pool: head.current_buf_seq = 0 at init (was: current_buf_seq=0 + rotation.generation=1 separately).
AICore local state: cached_buf_seq default is UINT32_MAX so the first record_task observes a mismatch and loads the buffer.
The two aicore_executor.cpp sites that aggregate-initialize the local state explicitly pass UINT32_MAX so the in-class default isn't shadowed.

Doc consistency sweep

docs/dfx/l2-swimlane-profiling.md — replace L2SwimlaneAicoreRotation / generation references with L2SwimlaneActiveHead / current_buf_seq.
aicore_profiling_state.h / l2_swimlane_collector_aicpu.h / l2_swimlane_collector_aicore.h — same.
l2_swimlane_collector_aicore.h::record_task docstring — clarify that head is lazy-resolved on the first task, not cached at kernel entry.
File-level diagram in l2_swimlane_profiling.h updated to show the new layout.

Test plan

All 8 sim DFX tests pass: pytest tests/st/.../dfx --platform a2a3sim --enable-l2-swimlane
Both l2_swimlane sim tests pass: pytest tests/st/.../dfx/l2_swimlane --platform a2a3sim --enable-l2-swimlane
pre-commit run clean on all touched files
CI green (onboard + sim, a2a3 + a5)
Follow-up: regression test for aicore_rotate failure paths (issue forthcoming)
Follow-up: perf measurement on paged_attention_unroll to validate head+counter cache-line sharing (issue forthcoming)

gemini-code-assist · 2026-05-31T06:36:24Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-05-31T06:36:32Z

📝 Walkthrough

Walkthrough

This PR refactors L2 swimlane profiling infrastructure from a per-core "rotation" channel model to an "active head" model. A new unified L2SwimlaneActiveHead cache-line structure centralizes buffer metadata, sequence tracking, and record counts. Pool layouts are restructured, kernel wiring is updated to stash and lazily dereference head pointers, host state management is rewritten to use nested head.* fields, and executors are adapted to the new public API.

Changes

L2 Swimlane Profiling Rotation to Active Head Migration

Layer / File(s)	Summary
Shared-memory pool schema refactoring `src/a2a3/platform/include/common/l2_swimlane_profiling.h`	Introduces `L2SwimlaneActiveHead` as a unified 64B structure for active-buffer metadata and refactors both AICPU and AICore pools to `ActiveHead + L2SwimlaneFreeQueue` (192B), removing the prior AICore-specific rotation channel.
Public API contract and header declarations `src/a2a3/platform/include/aicore/aicore_profiling_state.h`, `src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h`, `src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h`, `src/a2a3/platform/include/host/l2_swimlane_collector.h`	Updates AICore profiling state header and L2 swimlane collector headers to declare `set_l2_swimlane_aicore_head_slot` / `get_l2_swimlane_aicore_head` functions and adjust local state caching to sequence-based tracking.
Host AICPU/AICore buffer initialization and rotation `src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp` (init/rotate sections)	Refactors `l2_swimlane_aicpu_init` to populate per-core head-table and initialize buffers from head state; rewrites `aicore_rotate` to read/write buffer pointers and sequence numbers from active head with memory-barrier ordering.
Host flush and phase buffer management `src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp` (flush/phase sections)	Updates `l2_swimlane_aicpu_flush` to track active buffers via head state and migrates all phase profiling paths to use `head.current_buf_ptr/seq/total_record_count/dropped_record_count`.
Host collector initialization and state access `src/a2a3/platform/src/host/l2_swimlane_collector.cpp`	Initializes `head.current_buf_ptr/seq` in AICPU and phase pools; refactors `reconcile_counters()` and `finalize()` to access device-side counters and buffer pointers through the nested `head` structure.
Kernel profiling wiring: onboard and simulation `src/a2a3/platform/onboard/aicore/kernel.cpp`, `src/a2a3/platform/sim/aicore/kernel.cpp`	Implements lazy-resolution of L2 swimlane active-head pointer in both kernels; replaces rotation-slot TLS with head-slot TLS and updates kernel-entry initialization to publish head-slot address (or null) before executor dispatch.
Executor integration with head-based profiling `src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp`, `src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp`	Updates both AICore executors to declare and lazily resolve `L2SwimlaneActiveHead*` pointers, initialize local state with `UINT32_MAX` sentinel, and pass head pointers to `l2_swimlane_aicore_record_task`.
Validation and scheduler initialization `src/a2a3/platform/include/host/l2_swimlane_collector.h`, `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`	Adds entry-kind validation to prevent invalid buffer kinds from AICore handling; updates documentation comment; refactors scheduler cold-path to conditionally enable swimlanes and parameterize phase-thread count by orchestration mode.

Sequence Diagram

No sequence diagrams are generated for this diff: the changes are a cohesive architecture refactoring spanning multiple layers with sufficient complexity that would benefit from the layer-by-layer review walkthrough rather than a single high-level flow diagram.

🎯 4 (Complex) | ⏱️ ~75 minutes

🐰 Hop along the head, not the rotation!
One cache-line unified, sequences now speak clear,
Buffers dance with purpose—the architecture's premiere!
✨🔄

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main refactoring goal: unifying L2 swimlane pools around a new L2SwimlaneActiveHead cache line structure.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description is comprehensive, detailed, and directly related to the changeset. It clearly explains the refactoring of L2 swimlane pools, the extraction of L2SwimlaneActiveHead, memory savings, API renames, and includes testing details and follow-up items.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp (1)

303-315: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't count the current AICore buffer as dropped on rotation failure.

Lines 309-310 and Lines 331-332 add PLATFORM_AICORE_BUFFER_SIZE to dropped_record_count, but these branches leave head.current_buf_ptr and head.current_buf_seq unchanged. That same full buffer is still recoverable in l2_swimlane_aicpu_flush() via the live = total_record_count - current_buf_seq * BUFFER_SIZE path, so a later successful flush will double-account the records and can make reconcile report accounted > total_device.

Suggested direction

-        ac_state->head.dropped_record_count =
-            ac_state->head.dropped_record_count + static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE);
+        // Keep the full buffer recoverable; only account overflow attempts once
+        // we know how many writes actually happened past BUFFER_SIZE.

// In the flush path, derive actual overflow after a failed rotation:
uint32_t live = ac_state->head.total_record_count -
                ac_state->head.current_buf_seq * static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE);
uint32_t overflow =
    (live > static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE))
        ? (live - static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE))
        : 0;
ac_state->head.dropped_record_count += overflow;
uint32_t ac_mark = std::min(live, static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE));

Also applies to: 326-333

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp` around lines 303
- 315, The current rotation-failure branches in the AICPU collector (where
ac_state->head.dropped_record_count is increased by PLATFORM_AICORE_BUFFER_SIZE)
incorrectly count the entire buffer as dropped while
head.current_buf_ptr/head.current_buf_seq remain unchanged; instead, in the
rotation-failure path compute the actual overflow using the same logic used in
l2_swimlane_aicpu_flush: derive live = ac_state->head.total_record_count -
ac_state->head.current_buf_seq * PLATFORM_AICORE_BUFFER_SIZE, compute overflow =
max(0, live - PLATFORM_AICORE_BUFFER_SIZE), add only that overflow to
ac_state->head.dropped_record_count, and set the effective mark (ac_mark) to
min(live, PLATFORM_AICORE_BUFFER_SIZE); apply this fix to both places modifying
dropped_record_count (the branch around head==tail and the other similar
branch).

🧹 Nitpick comments (5)

src/a2a3/platform/include/common/l2_swimlane_profiling.h (2)

271-304: ⚡ Quick win

Add offset assertions for the pool ABI.

sizeof(...) == 192 will not catch a future reorder of head and free_queue. This refactor depends on head staying at offset 0 and free_queue at 64 for &pool.head publication and the shared-memory address helpers, so please lock those offsets down with offsetof static_asserts.

Suggested guardrails

+#include <cstddef>
 `#include` <cstdint>
 `#include` <vector>
@@
 static_assert(sizeof(L2SwimlaneAicpuTaskPool) == 192, "L2SwimlaneAicpuTaskPool must be 192 bytes");
+static_assert(offsetof(L2SwimlaneAicpuTaskPool, head) == 0, "L2SwimlaneAicpuTaskPool::head must stay first");
+static_assert(
+    offsetof(L2SwimlaneAicpuTaskPool, free_queue) == 64,
+    "L2SwimlaneAicpuTaskPool::free_queue must stay in the second cache line"
+);
@@
 static_assert(sizeof(L2SwimlaneAicoreTaskPool) == 192, "L2SwimlaneAicoreTaskPool must be 192 bytes");
+static_assert(offsetof(L2SwimlaneAicoreTaskPool, head) == 0, "L2SwimlaneAicoreTaskPool::head must stay first");
+static_assert(
+    offsetof(L2SwimlaneAicoreTaskPool, free_queue) == 64,
+    "L2SwimlaneAicoreTaskPool::free_queue must stay in the second cache line"
+);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/include/common/l2_swimlane_profiling.h` around lines 271 -
304, Add explicit offsetof static_asserts to lock field offsets for both
L2SwimlaneAicpuTaskPool and L2SwimlaneAicoreTaskPool: assert that
offsetof(L2SwimlaneAicpuTaskPool, head) == 0 and offsetof(..., free_queue) ==
64, and likewise for L2SwimlaneAicoreTaskPool, so the ABI guarantees that
&pool.head is at base and free_queue is at byte 64; place these asserts next to
the existing sizeof static_asserts and reference the struct names
L2SwimlaneAicpuTaskPool and L2SwimlaneAicoreTaskPool to ensure future reorders
fail compile-time.

217-298: ⚡ Quick win

Update the header-level layout diagram to match the new schema.

The new ActiveHead comments are clear, but the file-level memory map at Lines 18-52 still describes the pre-refactor per-pool fields and omits the AICore pool region. In this header that diagram is effectively the shared-memory ABI map, so it is worth keeping it in sync in the same change.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/include/common/l2_swimlane_profiling.h` around lines 217 -
298, Update the file-level shared-memory layout diagram to reflect the refactor:
replace the old per-pool field listing with the new unified cache-line head
(L2SwimlaneActiveHead) plus free-queue layout and include the AICore pool region
so the ABI map covers all pool kinds (L2SwimlaneAicpuTaskPool /
L2SwimlaneAicpuPhasePool / L2SwimlaneAicoreTaskPool). Make the diagram show
ActiveHead (64B) followed by L2SwimlaneFreeQueue (128B) per-pool, note the
fields present in L2SwimlaneActiveHead (current_buf_ptr, current_buf_seq,
total_record_count, dropped_record_count), and indicate the overall per-pool
size (192B) to match the static_asserts for L2SwimlaneAicpuTaskPool.

src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h (1)

73-75: ⚡ Quick win

Fix the head lifecycle comment.

This says head is cached at kernel entry, but aicore_profiling_state.h now documents that only the slot pointer is stashed there and the head itself is resolved lazily after first dispatch. Keeping the two public headers aligned matters here because callers are timing-sensitive.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h` around lines
73 - 75, The comment for parameter head is inaccurate: it states the full
L2SwimlaneActiveHead is cached at kernel entry, but upstream
aicore_profiling_state.h documents that only the slot pointer is stashed and the
actual head is resolved lazily on first dispatch; update the param doc for head
in l2_swimlane_collector_aicore.h to state that
KernelArgs::l2_swimlane_aicore_rotation_table[block_idx] stores only the slot
pointer and that the L2SwimlaneActiveHead is resolved lazily (not fully cached)
to match the behavior of the lazy resolution in aicore_profiling_state.h so
callers know the timing implications.

src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h (1)

70-72: ⚡ Quick win

Finish the terminology rename in this header.

This paragraph uses L2SwimlaneActiveHead, but the rotation-table comment immediately above still says AICPU publishes &L2SwimlaneAicoreTaskPool::rotation. Please update that block too so this public header describes a single contract.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h` around lines
70 - 72, The header comment is inconsistent: one paragraph references
L2SwimlaneActiveHead while the rotation-table comment still mentions
L2SwimlaneAicoreTaskPool::rotation; update the rotation-table comment so both
describe the same public contract name (use L2SwimlaneActiveHead consistently),
e.g., change any mentions of L2SwimlaneAicoreTaskPool::rotation to
L2SwimlaneActiveHead and ensure the description of who publishes/consumes that
channel matches the paragraph that says the per-core rotation channel is primed
by popping from L2SwimlaneAicoreTaskPool::free_queue and writing its address
into L2SwimlaneActiveHead.

src/a2a3/platform/src/host/l2_swimlane_collector.cpp (1)

244-250: ⚡ Quick win

Update the comment to the renamed head field.

This block now describes L2SwimlaneActiveHead addresses, but it still says AICPU has direct access to &ac_state->rotation. That field was removed in this refactor, so the comment now points readers at the wrong object.

Small doc fix

-    // direct access to `&ac_state->rotation` device addresses, no
+    // direct access to `&ac_state->head` device addresses, no

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/src/host/l2_swimlane_collector.cpp` around lines 244 - 250,
The comment refers to the removed field &ac_state->rotation; update it to
reference the renamed head field so it correctly describes where AICPU writes
L2SwimlaneActiveHead device addresses. Specifically, in the block describing the
standalone uint64_t[num_aicore] table and
KernelArgs::l2_swimlane_aicore_rotation_table, replace the reference to
&ac_state->rotation with the actual renamed member on ac_state (use the exact
identifier introduced by the refactor), and keep the rest of the explanation
intact (AICPU fills entries in l2_swimlane_aicpu_init and AICore reads
rotation_table[block_idx] at kernel entry). Ensure you mention
L2SwimlaneActiveHead, l2_swimlane_aicpu_init, and
KernelArgs::l2_swimlane_aicore_rotation_table so readers can find the related
code.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp`:
- Around line 303-315: The current rotation-failure branches in the AICPU
collector (where ac_state->head.dropped_record_count is increased by
PLATFORM_AICORE_BUFFER_SIZE) incorrectly count the entire buffer as dropped
while head.current_buf_ptr/head.current_buf_seq remain unchanged; instead, in
the rotation-failure path compute the actual overflow using the same logic used
in l2_swimlane_aicpu_flush: derive live = ac_state->head.total_record_count -
ac_state->head.current_buf_seq * PLATFORM_AICORE_BUFFER_SIZE, compute overflow =
max(0, live - PLATFORM_AICORE_BUFFER_SIZE), add only that overflow to
ac_state->head.dropped_record_count, and set the effective mark (ac_mark) to
min(live, PLATFORM_AICORE_BUFFER_SIZE); apply this fix to both places modifying
dropped_record_count (the branch around head==tail and the other similar
branch).

---

Nitpick comments:
In `@src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h`:
- Around line 73-75: The comment for parameter head is inaccurate: it states the
full L2SwimlaneActiveHead is cached at kernel entry, but upstream
aicore_profiling_state.h documents that only the slot pointer is stashed and the
actual head is resolved lazily on first dispatch; update the param doc for head
in l2_swimlane_collector_aicore.h to state that
KernelArgs::l2_swimlane_aicore_rotation_table[block_idx] stores only the slot
pointer and that the L2SwimlaneActiveHead is resolved lazily (not fully cached)
to match the behavior of the lazy resolution in aicore_profiling_state.h so
callers know the timing implications.

In `@src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h`:
- Around line 70-72: The header comment is inconsistent: one paragraph
references L2SwimlaneActiveHead while the rotation-table comment still mentions
L2SwimlaneAicoreTaskPool::rotation; update the rotation-table comment so both
describe the same public contract name (use L2SwimlaneActiveHead consistently),
e.g., change any mentions of L2SwimlaneAicoreTaskPool::rotation to
L2SwimlaneActiveHead and ensure the description of who publishes/consumes that
channel matches the paragraph that says the per-core rotation channel is primed
by popping from L2SwimlaneAicoreTaskPool::free_queue and writing its address
into L2SwimlaneActiveHead.

In `@src/a2a3/platform/include/common/l2_swimlane_profiling.h`:
- Around line 271-304: Add explicit offsetof static_asserts to lock field
offsets for both L2SwimlaneAicpuTaskPool and L2SwimlaneAicoreTaskPool: assert
that offsetof(L2SwimlaneAicpuTaskPool, head) == 0 and offsetof(..., free_queue)
== 64, and likewise for L2SwimlaneAicoreTaskPool, so the ABI guarantees that
&pool.head is at base and free_queue is at byte 64; place these asserts next to
the existing sizeof static_asserts and reference the struct names
L2SwimlaneAicpuTaskPool and L2SwimlaneAicoreTaskPool to ensure future reorders
fail compile-time.
- Around line 217-298: Update the file-level shared-memory layout diagram to
reflect the refactor: replace the old per-pool field listing with the new
unified cache-line head (L2SwimlaneActiveHead) plus free-queue layout and
include the AICore pool region so the ABI map covers all pool kinds
(L2SwimlaneAicpuTaskPool / L2SwimlaneAicpuPhasePool / L2SwimlaneAicoreTaskPool).
Make the diagram show ActiveHead (64B) followed by L2SwimlaneFreeQueue (128B)
per-pool, note the fields present in L2SwimlaneActiveHead (current_buf_ptr,
current_buf_seq, total_record_count, dropped_record_count), and indicate the
overall per-pool size (192B) to match the static_asserts for
L2SwimlaneAicpuTaskPool.

In `@src/a2a3/platform/src/host/l2_swimlane_collector.cpp`:
- Around line 244-250: The comment refers to the removed field
&ac_state->rotation; update it to reference the renamed head field so it
correctly describes where AICPU writes L2SwimlaneActiveHead device addresses.
Specifically, in the block describing the standalone uint64_t[num_aicore] table
and KernelArgs::l2_swimlane_aicore_rotation_table, replace the reference to
&ac_state->rotation with the actual renamed member on ac_state (use the exact
identifier introduced by the refactor), and keep the rest of the explanation
intact (AICPU fills entries in l2_swimlane_aicpu_init and AICore reads
rotation_table[block_idx] at kernel entry). Ensure you mention
L2SwimlaneActiveHead, l2_swimlane_aicpu_init, and
KernelArgs::l2_swimlane_aicore_rotation_table so readers can find the related
code.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c8db322d-6f52-48d1-9872-8f3a3f2e8b54

📥 Commits

Reviewing files that changed from the base of the PR and between cee40dd and d057f2c.

📒 Files selected for processing (12)

src/a2a3/platform/include/aicore/aicore_profiling_state.h
src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h
src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h
src/a2a3/platform/include/common/l2_swimlane_profiling.h
src/a2a3/platform/include/host/l2_swimlane_collector.h
src/a2a3/platform/onboard/aicore/kernel.cpp
src/a2a3/platform/sim/aicore/kernel.cpp
src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp
src/a2a3/platform/src/host/l2_swimlane_collector.cpp
src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp

The standalone phase header was a vestige of when phase profiling was an add-on bolted onto the shared-memory layout. Phase metadata is now co-equal with task pool metadata, so the dedicated cache line + magic gate + indirection are pure overhead. Move the three live phase-header fields directly into the root header: - num_sched_threads → num_phase_threads (renamed for clarity; it counts phase pools, which equals sched_thread_num or aicpu_thread_num depending on PTO2_ORCH_TO_SCHED) - num_cores → num_phase_cores (disambiguate from the root header's pre-existing num_cores — they have different semantics) - core_to_thread[PLATFORM_MAX_CORES] — verbatim Dropped: - magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates on `num_phase_threads > 0` (zero-init means phase init never ran). - records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD but no caller ever read it — dead field. Shared-memory layout after the merge: [L2SwimlaneDataHeader (now includes phase metadata)] [L2SwimlaneAicpuTaskPool × num_cores] [L2SwimlaneAicoreTaskPool × num_cores] [L2SwimlaneAicpuPhasePool × num_phase_threads] ← was preceded by header `get_phase_header()` is deleted; `get_phase_buffer_states()` skips straight from the AicoreTaskPool array to the phase pools. AICPU collector keeps a separate `s_phase_initialized` bool so gated paths can check init-ran without re-reading the device-shared header on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header == nullptr` check. Built atop hw-native-sys#939 (ActiveHead refactor). Test plan: - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass - sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass - pre-commit clean

Three pool types — AICPU task, AICPU phase, AICore task — previously defined their own ad-hoc layouts: - AICPU/Phase pool: free_queue (128) + scattered counters + pad → 192B - AICore pool: rotation (64) + free_queue (128) + counters + pad → 256B Extract the common producer-side cache line as `L2SwimlaneActiveHead` (64B: current_buf_ptr + current_buf_seq + total/dropped record counts). Every pool is now exactly `head + free_queue = 192B`. AICore pool drops the standalone `L2SwimlaneAicoreRotation` struct entirely — its old fields (`current_buf_ptr` + `generation`) merge into the head's `current_buf_ptr` + `current_buf_seq` (semantics identical: AICore reads seq to detect a rotation, then reads ptr to pick up the new buffer). Saves 64B per AICore core (256→192) and removes the duplicated "buf seq / generation" counter pair that was bumped twice per rotation. Hot-path verification: AICore still `dcci(head, SINGLE_CACHE_LINE)` — the head sits in a single cache line by alignment, so per-task cost is unchanged. False-sharing audit: head is single-writer (AICPU) for all three pools; readers are either AICore (dcci, AicoreTask pool only) or host at drain time. AICore never reads the counter fields, so the invalidation it pays per task is harmless even though counters cohabit the line. Renames (atomic with the layout change, so no compat shim): set_l2_swimlane_aicore_rotation_slot → set_l2_swimlane_aicore_head_slot get_l2_swimlane_aicore_rotation → get_l2_swimlane_aicore_head L2SwimlaneAicoreRotation → L2SwimlaneActiveHead L2SwimlaneAicoreLocalState::cached_generation → cached_buf_seq KernelArgs::l2_swimlane_aicore_rotation_table field name is preserved for ABI stability; its comment now notes that slots hold `L2SwimlaneActiveHead*` addresses. Init shift: AicoreTask pool's head.current_buf_seq starts at 0 (was: current_buf_seq=0 + rotation.generation=1). AICore local state's cached_buf_seq starts at UINT32_MAX so the first record_task call observes a mismatch and loads the buffer. The two aicore_executor sites that aggregate-initialize the local state explicitly pass UINT32_MAX so the in-class default isn't shadowed. Test plan: - pytest tests/st/.../dfx --platform a2a3sim --enable-l2-swimlane → all 8 pass - pytest tests/st/.../dfx/l2_swimlane --platform a2a3sim --enable-l2-swimlane → 2 pass - pre-commit clean on all touched files - CI green expected (onboard + sim, a2a3 + a5)

The standalone phase header was a vestige of when phase profiling was an add-on bolted onto the shared-memory layout. Phase metadata is now co-equal with task pool metadata, so the dedicated cache line + magic gate + indirection are pure overhead. Move the three live phase-header fields directly into the root header: - num_sched_threads → num_phase_threads (renamed for clarity; it counts phase pools, which equals sched_thread_num or aicpu_thread_num depending on PTO2_ORCH_TO_SCHED) - num_cores → num_phase_cores (disambiguate from the root header's pre-existing num_cores — they have different semantics) - core_to_thread[PLATFORM_MAX_CORES] — verbatim Dropped: - magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates on `num_phase_threads > 0` (zero-init means phase init never ran). - records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD but no caller ever read it — dead field. Shared-memory layout after the merge: [L2SwimlaneDataHeader (now includes phase metadata)] [L2SwimlaneAicpuTaskPool × num_cores] [L2SwimlaneAicoreTaskPool × num_cores] [L2SwimlaneAicpuPhasePool × num_phase_threads] ← was preceded by header `get_phase_header()` is deleted; `get_phase_buffer_states()` skips straight from the AicoreTaskPool array to the phase pools. AICPU collector keeps a separate `s_phase_initialized` bool so gated paths can check init-ran without re-reading the device-shared header on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header == nullptr` check. Built atop hw-native-sys#939 (ActiveHead refactor). Test plan: - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass - sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass - pre-commit clean

…941) The standalone phase header was a vestige of when phase profiling was an add-on bolted onto the shared-memory layout. Phase metadata is now co-equal with task pool metadata, so the dedicated cache line + magic gate + indirection are pure overhead. Move the three live phase-header fields directly into the root header: - num_sched_threads → num_phase_threads (renamed for clarity; it counts phase pools, which equals sched_thread_num or aicpu_thread_num depending on PTO2_ORCH_TO_SCHED) - num_cores → num_phase_cores (disambiguate from the root header's pre-existing num_cores — they have different semantics) - core_to_thread[PLATFORM_MAX_CORES] — verbatim Dropped: - magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates on `num_phase_threads > 0` (zero-init means phase init never ran). - records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD but no caller ever read it — dead field. Shared-memory layout after the merge: [L2SwimlaneDataHeader (now includes phase metadata)] [L2SwimlaneAicpuTaskPool × num_cores] [L2SwimlaneAicoreTaskPool × num_cores] [L2SwimlaneAicpuPhasePool × num_phase_threads] ← was preceded by header `get_phase_header()` is deleted; `get_phase_buffer_states()` skips straight from the AicoreTaskPool array to the phase pools. AICPU collector keeps a separate `s_phase_initialized` bool so gated paths can check init-ran without re-reading the device-shared header on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header == nullptr` check. Built atop #939 (ActiveHead refactor). Test plan: - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass - sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass - pre-commit clean Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>

After #939 (pool unification), #941 (PhaseHeader merge), and #942 (split sched/orch phase records), several comments and doc sections still referenced the pre-split a2a3 layout. Audit and update: a2a3 code/comments: - platform_config.h: PROF_BUFFERS_PER_THREAD doc references both SchedPhaseBuffer and OrchPhaseBuffer (was: single PhaseBuffer); PROF_READYQUEUE_SIZE comment now says "four kinds"; formula bumped by 2x on the per-thread term to cover both sched and orch pool enqueues (matches host alloc which iterates both pool arrays). - l2_swimlane_profiling.h header layout diagram: name the two split phase-thread counts. - l2_swimlane_collector_aicpu.cpp: cross-launch reset comment now references s_sched_phase_pools / s_orch_phase_pools (was: single s_aicpu_phase_pools) and record_sched_phase / record_orch_phase. - scheduler_dispatch.cpp / aicpu_executor.cpp: comments reference the split record types. src/common/ shared comments (now mixed-arch): - profiler_base.h / buffer_pool_manager.h: qualify L2SwimlaneAicpuPhaseHeader::magic example as "on a5" since the struct no longer exists on a2a3. docs/dfx/l2-swimlane-profiling.md: - §5.1: layout block + record list now distinguish a2a3 split shape (SchedPhaseRecord 40B + OrchPhaseRecord 32B, two pool arrays) from a5's still-unified shape (pending port). - §5.2: a2a3 buffer-kind list updated to all four kinds (was: two); ASCII data-flow diagram redrawn to show split phase records; kBufferKinds = 4 in the L2SwimlaneModule trait description. - §5.3 (a5): num_phase_threads / core_to_thread[] reference corrected to live in L2SwimlaneAicpuPhaseHeader on a5 (was wrongly attributed to L2SwimlaneDataHeader). - §5.4: comparison table separates task record (identical) from phase record (diverged); ready-queue and kBufferKinds rows call out the a2a3=4 vs a5=2 split. - §6: overhead description differentiates a2a3's per-emit SchedPhase + per-submit OrchPhase from a5's unified PhaseRecord (was: "4 phases × 40B per iteration", which described a removed shape). - §8 FAQ: "phase records empty" entry gates a2a3 on num_{sched,orch}_phase_threads, a5 on PhaseHeader::magic. No semantic code changes except the READYQUEUE_SIZE formula bump (adds ~8KB to the header; necessary correctness fix given the second phase pool). Test plan: - pre-commit clean - onboard l2_swimlane STs (--enable-l2-swimlane --enable-dep-gen): 2 passed - onboard paged_attention_unroll level 4: 1 passed Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>

coderabbitai Bot reviewed May 31, 2026

View reviewed changes

hw-native-sys-bot mentioned this pull request May 31, 2026

Refactor: merge L2SwimlaneAicpuPhaseHeader into L2SwimlaneDataHeader #941

Merged

5 tasks

ChaoWao force-pushed the refactor/swimlane-cache-line-blocks branch from d057f2c to 0864889 Compare May 31, 2026 07:18

ChaoWao force-pushed the refactor/swimlane-cache-line-blocks branch from 0864889 to 7a2345a Compare May 31, 2026 07:29

hw-native-sys-bot mentioned this pull request May 31, 2026

Refactor: split L2 swimlane phase records into sched + orch types #942

Merged

5 tasks

ChaoWao force-pushed the refactor/swimlane-cache-line-blocks branch from 7a2345a to f681174 Compare May 31, 2026 08:15

hw-native-sys-bot mentioned this pull request May 31, 2026

L2 swimlane: regression test for aicore_rotate failure paths + perf check on cache-line sharing #943

Open

ChaoWao approved these changes May 31, 2026

View reviewed changes

ChaoWao merged commit dc83d5a into hw-native-sys:main May 31, 2026
15 checks passed

ChaoWao deleted the refactor/swimlane-cache-line-blocks branch May 31, 2026 08:53

hw-native-sys-bot mentioned this pull request May 31, 2026

Doc: sync L2 swimlane refs to post-split layout #946

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: unify L2 swimlane pools around L2SwimlaneActiveHead cache line#939

Refactor: unify L2 swimlane pools around L2SwimlaneActiveHead cache line#939
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/swimlane-cache-line-blocks

hw-native-sys-bot commented May 31, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented May 31, 2026

Uh oh!

coderabbitai Bot commented May 31, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hw-native-sys-bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Not a pure rename: aicore_rotate publish ordering strengthened

Pre-existing bug fix: drop the pre-emptive dropped += BUFFER_SIZE

Hot-path verification

Renames (atomic with the layout change — no compat shim)

Init shift (semantically equivalent)

Doc consistency sweep

Test plan

Uh oh!

gemini-code-assist Bot commented May 31, 2026

Uh oh!

coderabbitai Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hw-native-sys-bot commented May 31, 2026 •

edited

Loading

Pre-existing bug fix: drop the pre-emptive `dropped += BUFFER_SIZE`

coderabbitai Bot commented May 31, 2026 •

edited

Loading