Skip to content

Refactor: unify L2 swimlane pools around L2SwimlaneActiveHead cache line#939

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/swimlane-cache-line-blocks
May 31, 2026
Merged

Refactor: unify L2 swimlane pools around L2SwimlaneActiveHead cache line#939
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/swimlane-cache-line-blocks

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

@hw-native-sys-bot hw-native-sys-bot commented May 31, 2026

Summary

Three pool types (AICPU task, AICPU phase, AICore task) previously had ad-hoc layouts. AICPU/Phase pool was 192B (free_queue + scattered counters + pad); AICore pool was 256B (rotation + free_queue + counters + pad).

This PR extracts the common producer-side cache line as L2SwimlaneActiveHead (64B: current_buf_ptr + current_buf_seq + total/dropped record counts). Every pool is now exactly head + free_queue = 192B. The standalone L2SwimlaneAicoreRotation struct is deleted — its old fields merge into the head's current_buf_ptr + current_buf_seq (semantics identical: AICore reads seq to detect a rotation, then reads ptr to pick up the new buffer).

Savings: 64B per AICore core (256→192). Removes the duplicated "buf_seq / generation" counter pair that was bumped twice per rotation.

Not a pure rename: aicore_rotate publish ordering strengthened

The new publish sequence in aicore_rotate adds an explicit wmb() between the ptr write and the seq write:

// New (3 wmb)
wmb();
ac_state->head.current_buf_ptr = new_buf_ptr;
wmb();   // ← new: ordering fence between the two volatile stores
ac_state->head.current_buf_seq = seq + 1;
wmb();

The old code had two adjacent volatile stores (rotation.current_buf_ptr then rotation.generation) under a single trailing wmb(). On ARM/aarch64's weak memory model this leaves room for AICore's dcci+read to observe the new seq with a stale ptr. The added fence closes that window.

The init-prime path uses the same publish pattern now (ptr → wmb → seq → wmb), for symmetry and future-proofing against changes that let AICore poll before the first dispatch.

Pre-existing bug fix: drop the pre-emptive dropped += BUFFER_SIZE

When aicore_rotate failed (free queue empty OR ready queue full) the old code pre-emptively bumped dropped_record_count += PLATFORM_AICORE_BUFFER_SIZE. That over-counted whenever:

  • The OLD buffer still made it to the host via the flush retry path → collected += BUFFER_SIZE and dropped += BUFFER_SIZE for the same records, breaking the collected + dropped == total reconcile invariant.
  • The run ended before AICore actually overflowed the slot guard for a full BUFFER_SIZE more tasks → the pre-emptive drop was larger than the real loss.

Fix: remove the pre-emptive bump in both failure branches. The slot guard on AICore silently drops further records (which AICPU has no precise count of), and reconcile reports the gap as silent_loss = total - collected - dropped. The ready-queue-full branch additionally now lets the run-end flush retry the enqueue (the host may have drained by then) rather than pre-counting the records as lost.

A regression test exercising forced rotation failures is followed up in a separate issue (the trigger is hard to set up in the existing 5-task vector example).

Hot-path verification

  • AICore still dcci(head, SINGLE_CACHE_LINE) — head fits one cache line by alignas(64), so per-task cost is unchanged.
  • False-sharing audit: head is single-writer (AICPU) for all three pools; readers are AICore (dcci, AicoreTask pool only) or host at drain time. AICore never reads counter fields, so the invalidation it pays per task is harmless even though counters cohabit the line.
  • offsetof static_asserts added on both pool types to lock head@0 / free_queue@64 — silent layout drift would corrupt the AICore-readable head address the AICPU init publishes into the rotation table.

A perf measurement (paged_attention_unroll --enable-l2-swimlane 4 vs main) is followed up in a separate issue to validate that the cache-line sharing of head + counters doesn't regress AICore dcci throughput.

Renames (atomic with the layout change — no compat shim)

Old New
set_l2_swimlane_aicore_rotation_slot set_l2_swimlane_aicore_head_slot
get_l2_swimlane_aicore_rotation get_l2_swimlane_aicore_head
L2SwimlaneAicoreRotation L2SwimlaneActiveHead
L2SwimlaneAicoreLocalState::cached_generation cached_buf_seq

KernelArgs::l2_swimlane_aicore_rotation_table field name preserved for ABI stability; its comment now notes that slots hold L2SwimlaneActiveHead* addresses.

Init shift (semantically equivalent)

  • AicoreTask pool: head.current_buf_seq = 0 at init (was: current_buf_seq=0 + rotation.generation=1 separately).
  • AICore local state: cached_buf_seq default is UINT32_MAX so the first record_task observes a mismatch and loads the buffer.
  • The two aicore_executor.cpp sites that aggregate-initialize the local state explicitly pass UINT32_MAX so the in-class default isn't shadowed.

Doc consistency sweep

  • docs/dfx/l2-swimlane-profiling.md — replace L2SwimlaneAicoreRotation / generation references with L2SwimlaneActiveHead / current_buf_seq.
  • aicore_profiling_state.h / l2_swimlane_collector_aicpu.h / l2_swimlane_collector_aicore.h — same.
  • l2_swimlane_collector_aicore.h::record_task docstring — clarify that head is lazy-resolved on the first task, not cached at kernel entry.
  • File-level diagram in l2_swimlane_profiling.h updated to show the new layout.

Test plan

  • All 8 sim DFX tests pass: pytest tests/st/.../dfx --platform a2a3sim --enable-l2-swimlane
  • Both l2_swimlane sim tests pass: pytest tests/st/.../dfx/l2_swimlane --platform a2a3sim --enable-l2-swimlane
  • pre-commit run clean on all touched files
  • CI green (onboard + sim, a2a3 + a5)
  • Follow-up: regression test for aicore_rotate failure paths (issue forthcoming)
  • Follow-up: perf measurement on paged_attention_unroll to validate head+counter cache-line sharing (issue forthcoming)

@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 31, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR refactors L2 swimlane profiling infrastructure from a per-core "rotation" channel model to an "active head" model. A new unified L2SwimlaneActiveHead cache-line structure centralizes buffer metadata, sequence tracking, and record counts. Pool layouts are restructured, kernel wiring is updated to stash and lazily dereference head pointers, host state management is rewritten to use nested head.* fields, and executors are adapted to the new public API.

Changes

L2 Swimlane Profiling Rotation to Active Head Migration

Layer / File(s) Summary
Shared-memory pool schema refactoring
src/a2a3/platform/include/common/l2_swimlane_profiling.h
Introduces L2SwimlaneActiveHead as a unified 64B structure for active-buffer metadata and refactors both AICPU and AICore pools to ActiveHead + L2SwimlaneFreeQueue (192B), removing the prior AICore-specific rotation channel.
Public API contract and header declarations
src/a2a3/platform/include/aicore/aicore_profiling_state.h, src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h, src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h, src/a2a3/platform/include/host/l2_swimlane_collector.h
Updates AICore profiling state header and L2 swimlane collector headers to declare set_l2_swimlane_aicore_head_slot / get_l2_swimlane_aicore_head functions and adjust local state caching to sequence-based tracking.
Host AICPU/AICore buffer initialization and rotation
src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp (init/rotate sections)
Refactors l2_swimlane_aicpu_init to populate per-core head-table and initialize buffers from head state; rewrites aicore_rotate to read/write buffer pointers and sequence numbers from active head with memory-barrier ordering.
Host flush and phase buffer management
src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp (flush/phase sections)
Updates l2_swimlane_aicpu_flush to track active buffers via head state and migrates all phase profiling paths to use head.current_buf_ptr/seq/total_record_count/dropped_record_count.
Host collector initialization and state access
src/a2a3/platform/src/host/l2_swimlane_collector.cpp
Initializes head.current_buf_ptr/seq in AICPU and phase pools; refactors reconcile_counters() and finalize() to access device-side counters and buffer pointers through the nested head structure.
Kernel profiling wiring: onboard and simulation
src/a2a3/platform/onboard/aicore/kernel.cpp, src/a2a3/platform/sim/aicore/kernel.cpp
Implements lazy-resolution of L2 swimlane active-head pointer in both kernels; replaces rotation-slot TLS with head-slot TLS and updates kernel-entry initialization to publish head-slot address (or null) before executor dispatch.
Executor integration with head-based profiling
src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp, src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
Updates both AICore executors to declare and lazily resolve L2SwimlaneActiveHead* pointers, initialize local state with UINT32_MAX sentinel, and pass head pointers to l2_swimlane_aicore_record_task.
Validation and scheduler initialization
src/a2a3/platform/include/host/l2_swimlane_collector.h, src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
Adds entry-kind validation to prevent invalid buffer kinds from AICore handling; updates documentation comment; refactors scheduler cold-path to conditionally enable swimlanes and parameterize phase-thread count by orchestration mode.

Sequence Diagram

No sequence diagrams are generated for this diff: the changes are a cohesive architecture refactoring spanning multiple layers with sufficient complexity that would benefit from the layer-by-layer review walkthrough rather than a single high-level flow diagram.

🎯 4 (Complex) | ⏱️ ~75 minutes

🐰 Hop along the head, not the rotation!
One cache-line unified, sequences now speak clear,
Buffers dance with purpose—the architecture's premiere!
✨🔄

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main refactoring goal: unifying L2 swimlane pools around a new L2SwimlaneActiveHead cache line structure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is comprehensive, detailed, and directly related to the changeset. It clearly explains the refactoring of L2 swimlane pools, the extraction of L2SwimlaneActiveHead, memory savings, API renames, and includes testing details and follow-up items.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp (1)

303-315: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't count the current AICore buffer as dropped on rotation failure.

Lines 309-310 and Lines 331-332 add PLATFORM_AICORE_BUFFER_SIZE to dropped_record_count, but these branches leave head.current_buf_ptr and head.current_buf_seq unchanged. That same full buffer is still recoverable in l2_swimlane_aicpu_flush() via the live = total_record_count - current_buf_seq * BUFFER_SIZE path, so a later successful flush will double-account the records and can make reconcile report accounted > total_device.

Suggested direction
-        ac_state->head.dropped_record_count =
-            ac_state->head.dropped_record_count + static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE);
+        // Keep the full buffer recoverable; only account overflow attempts once
+        // we know how many writes actually happened past BUFFER_SIZE.
// In the flush path, derive actual overflow after a failed rotation:
uint32_t live = ac_state->head.total_record_count -
                ac_state->head.current_buf_seq * static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE);
uint32_t overflow =
    (live > static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE))
        ? (live - static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE))
        : 0;
ac_state->head.dropped_record_count += overflow;
uint32_t ac_mark = std::min(live, static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE));

Also applies to: 326-333

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp` around lines 303
- 315, The current rotation-failure branches in the AICPU collector (where
ac_state->head.dropped_record_count is increased by PLATFORM_AICORE_BUFFER_SIZE)
incorrectly count the entire buffer as dropped while
head.current_buf_ptr/head.current_buf_seq remain unchanged; instead, in the
rotation-failure path compute the actual overflow using the same logic used in
l2_swimlane_aicpu_flush: derive live = ac_state->head.total_record_count -
ac_state->head.current_buf_seq * PLATFORM_AICORE_BUFFER_SIZE, compute overflow =
max(0, live - PLATFORM_AICORE_BUFFER_SIZE), add only that overflow to
ac_state->head.dropped_record_count, and set the effective mark (ac_mark) to
min(live, PLATFORM_AICORE_BUFFER_SIZE); apply this fix to both places modifying
dropped_record_count (the branch around head==tail and the other similar
branch).
🧹 Nitpick comments (5)
src/a2a3/platform/include/common/l2_swimlane_profiling.h (2)

271-304: ⚡ Quick win

Add offset assertions for the pool ABI.

sizeof(...) == 192 will not catch a future reorder of head and free_queue. This refactor depends on head staying at offset 0 and free_queue at 64 for &pool.head publication and the shared-memory address helpers, so please lock those offsets down with offsetof static_asserts.

Suggested guardrails
+#include <cstddef>
 `#include` <cstdint>
 `#include` <vector>
@@
 static_assert(sizeof(L2SwimlaneAicpuTaskPool) == 192, "L2SwimlaneAicpuTaskPool must be 192 bytes");
+static_assert(offsetof(L2SwimlaneAicpuTaskPool, head) == 0, "L2SwimlaneAicpuTaskPool::head must stay first");
+static_assert(
+    offsetof(L2SwimlaneAicpuTaskPool, free_queue) == 64,
+    "L2SwimlaneAicpuTaskPool::free_queue must stay in the second cache line"
+);
@@
 static_assert(sizeof(L2SwimlaneAicoreTaskPool) == 192, "L2SwimlaneAicoreTaskPool must be 192 bytes");
+static_assert(offsetof(L2SwimlaneAicoreTaskPool, head) == 0, "L2SwimlaneAicoreTaskPool::head must stay first");
+static_assert(
+    offsetof(L2SwimlaneAicoreTaskPool, free_queue) == 64,
+    "L2SwimlaneAicoreTaskPool::free_queue must stay in the second cache line"
+);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/include/common/l2_swimlane_profiling.h` around lines 271 -
304, Add explicit offsetof static_asserts to lock field offsets for both
L2SwimlaneAicpuTaskPool and L2SwimlaneAicoreTaskPool: assert that
offsetof(L2SwimlaneAicpuTaskPool, head) == 0 and offsetof(..., free_queue) ==
64, and likewise for L2SwimlaneAicoreTaskPool, so the ABI guarantees that
&pool.head is at base and free_queue is at byte 64; place these asserts next to
the existing sizeof static_asserts and reference the struct names
L2SwimlaneAicpuTaskPool and L2SwimlaneAicoreTaskPool to ensure future reorders
fail compile-time.

217-298: ⚡ Quick win

Update the header-level layout diagram to match the new schema.

The new ActiveHead comments are clear, but the file-level memory map at Lines 18-52 still describes the pre-refactor per-pool fields and omits the AICore pool region. In this header that diagram is effectively the shared-memory ABI map, so it is worth keeping it in sync in the same change.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/include/common/l2_swimlane_profiling.h` around lines 217 -
298, Update the file-level shared-memory layout diagram to reflect the refactor:
replace the old per-pool field listing with the new unified cache-line head
(L2SwimlaneActiveHead) plus free-queue layout and include the AICore pool region
so the ABI map covers all pool kinds (L2SwimlaneAicpuTaskPool /
L2SwimlaneAicpuPhasePool / L2SwimlaneAicoreTaskPool). Make the diagram show
ActiveHead (64B) followed by L2SwimlaneFreeQueue (128B) per-pool, note the
fields present in L2SwimlaneActiveHead (current_buf_ptr, current_buf_seq,
total_record_count, dropped_record_count), and indicate the overall per-pool
size (192B) to match the static_asserts for L2SwimlaneAicpuTaskPool.
src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h (1)

73-75: ⚡ Quick win

Fix the head lifecycle comment.

This says head is cached at kernel entry, but aicore_profiling_state.h now documents that only the slot pointer is stashed there and the head itself is resolved lazily after first dispatch. Keeping the two public headers aligned matters here because callers are timing-sensitive.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h` around lines
73 - 75, The comment for parameter head is inaccurate: it states the full
L2SwimlaneActiveHead is cached at kernel entry, but upstream
aicore_profiling_state.h documents that only the slot pointer is stashed and the
actual head is resolved lazily on first dispatch; update the param doc for head
in l2_swimlane_collector_aicore.h to state that
KernelArgs::l2_swimlane_aicore_rotation_table[block_idx] stores only the slot
pointer and that the L2SwimlaneActiveHead is resolved lazily (not fully cached)
to match the behavior of the lazy resolution in aicore_profiling_state.h so
callers know the timing implications.
src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h (1)

70-72: ⚡ Quick win

Finish the terminology rename in this header.

This paragraph uses L2SwimlaneActiveHead, but the rotation-table comment immediately above still says AICPU publishes &L2SwimlaneAicoreTaskPool::rotation. Please update that block too so this public header describes a single contract.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h` around lines
70 - 72, The header comment is inconsistent: one paragraph references
L2SwimlaneActiveHead while the rotation-table comment still mentions
L2SwimlaneAicoreTaskPool::rotation; update the rotation-table comment so both
describe the same public contract name (use L2SwimlaneActiveHead consistently),
e.g., change any mentions of L2SwimlaneAicoreTaskPool::rotation to
L2SwimlaneActiveHead and ensure the description of who publishes/consumes that
channel matches the paragraph that says the per-core rotation channel is primed
by popping from L2SwimlaneAicoreTaskPool::free_queue and writing its address
into L2SwimlaneActiveHead.
src/a2a3/platform/src/host/l2_swimlane_collector.cpp (1)

244-250: ⚡ Quick win

Update the comment to the renamed head field.

This block now describes L2SwimlaneActiveHead addresses, but it still says AICPU has direct access to &ac_state->rotation. That field was removed in this refactor, so the comment now points readers at the wrong object.

Small doc fix
-    // direct access to `&ac_state->rotation` device addresses, no
+    // direct access to `&ac_state->head` device addresses, no
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/src/host/l2_swimlane_collector.cpp` around lines 244 - 250,
The comment refers to the removed field &ac_state->rotation; update it to
reference the renamed head field so it correctly describes where AICPU writes
L2SwimlaneActiveHead device addresses. Specifically, in the block describing the
standalone uint64_t[num_aicore] table and
KernelArgs::l2_swimlane_aicore_rotation_table, replace the reference to
&ac_state->rotation with the actual renamed member on ac_state (use the exact
identifier introduced by the refactor), and keep the rest of the explanation
intact (AICPU fills entries in l2_swimlane_aicpu_init and AICore reads
rotation_table[block_idx] at kernel entry). Ensure you mention
L2SwimlaneActiveHead, l2_swimlane_aicpu_init, and
KernelArgs::l2_swimlane_aicore_rotation_table so readers can find the related
code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp`:
- Around line 303-315: The current rotation-failure branches in the AICPU
collector (where ac_state->head.dropped_record_count is increased by
PLATFORM_AICORE_BUFFER_SIZE) incorrectly count the entire buffer as dropped
while head.current_buf_ptr/head.current_buf_seq remain unchanged; instead, in
the rotation-failure path compute the actual overflow using the same logic used
in l2_swimlane_aicpu_flush: derive live = ac_state->head.total_record_count -
ac_state->head.current_buf_seq * PLATFORM_AICORE_BUFFER_SIZE, compute overflow =
max(0, live - PLATFORM_AICORE_BUFFER_SIZE), add only that overflow to
ac_state->head.dropped_record_count, and set the effective mark (ac_mark) to
min(live, PLATFORM_AICORE_BUFFER_SIZE); apply this fix to both places modifying
dropped_record_count (the branch around head==tail and the other similar
branch).

---

Nitpick comments:
In `@src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h`:
- Around line 73-75: The comment for parameter head is inaccurate: it states the
full L2SwimlaneActiveHead is cached at kernel entry, but upstream
aicore_profiling_state.h documents that only the slot pointer is stashed and the
actual head is resolved lazily on first dispatch; update the param doc for head
in l2_swimlane_collector_aicore.h to state that
KernelArgs::l2_swimlane_aicore_rotation_table[block_idx] stores only the slot
pointer and that the L2SwimlaneActiveHead is resolved lazily (not fully cached)
to match the behavior of the lazy resolution in aicore_profiling_state.h so
callers know the timing implications.

In `@src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h`:
- Around line 70-72: The header comment is inconsistent: one paragraph
references L2SwimlaneActiveHead while the rotation-table comment still mentions
L2SwimlaneAicoreTaskPool::rotation; update the rotation-table comment so both
describe the same public contract name (use L2SwimlaneActiveHead consistently),
e.g., change any mentions of L2SwimlaneAicoreTaskPool::rotation to
L2SwimlaneActiveHead and ensure the description of who publishes/consumes that
channel matches the paragraph that says the per-core rotation channel is primed
by popping from L2SwimlaneAicoreTaskPool::free_queue and writing its address
into L2SwimlaneActiveHead.

In `@src/a2a3/platform/include/common/l2_swimlane_profiling.h`:
- Around line 271-304: Add explicit offsetof static_asserts to lock field
offsets for both L2SwimlaneAicpuTaskPool and L2SwimlaneAicoreTaskPool: assert
that offsetof(L2SwimlaneAicpuTaskPool, head) == 0 and offsetof(..., free_queue)
== 64, and likewise for L2SwimlaneAicoreTaskPool, so the ABI guarantees that
&pool.head is at base and free_queue is at byte 64; place these asserts next to
the existing sizeof static_asserts and reference the struct names
L2SwimlaneAicpuTaskPool and L2SwimlaneAicoreTaskPool to ensure future reorders
fail compile-time.
- Around line 217-298: Update the file-level shared-memory layout diagram to
reflect the refactor: replace the old per-pool field listing with the new
unified cache-line head (L2SwimlaneActiveHead) plus free-queue layout and
include the AICore pool region so the ABI map covers all pool kinds
(L2SwimlaneAicpuTaskPool / L2SwimlaneAicpuPhasePool / L2SwimlaneAicoreTaskPool).
Make the diagram show ActiveHead (64B) followed by L2SwimlaneFreeQueue (128B)
per-pool, note the fields present in L2SwimlaneActiveHead (current_buf_ptr,
current_buf_seq, total_record_count, dropped_record_count), and indicate the
overall per-pool size (192B) to match the static_asserts for
L2SwimlaneAicpuTaskPool.

In `@src/a2a3/platform/src/host/l2_swimlane_collector.cpp`:
- Around line 244-250: The comment refers to the removed field
&ac_state->rotation; update it to reference the renamed head field so it
correctly describes where AICPU writes L2SwimlaneActiveHead device addresses.
Specifically, in the block describing the standalone uint64_t[num_aicore] table
and KernelArgs::l2_swimlane_aicore_rotation_table, replace the reference to
&ac_state->rotation with the actual renamed member on ac_state (use the exact
identifier introduced by the refactor), and keep the rest of the explanation
intact (AICPU fills entries in l2_swimlane_aicpu_init and AICore reads
rotation_table[block_idx] at kernel entry). Ensure you mention
L2SwimlaneActiveHead, l2_swimlane_aicpu_init, and
KernelArgs::l2_swimlane_aicore_rotation_table so readers can find the related
code.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c8db322d-6f52-48d1-9872-8f3a3f2e8b54

📥 Commits

Reviewing files that changed from the base of the PR and between cee40dd and d057f2c.

📒 Files selected for processing (12)
  • src/a2a3/platform/include/aicore/aicore_profiling_state.h
  • src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h
  • src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h
  • src/a2a3/platform/include/common/l2_swimlane_profiling.h
  • src/a2a3/platform/include/host/l2_swimlane_collector.h
  • src/a2a3/platform/onboard/aicore/kernel.cpp
  • src/a2a3/platform/sim/aicore/kernel.cpp
  • src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp
  • src/a2a3/platform/src/host/l2_swimlane_collector.cpp
  • src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp

@ChaoWao ChaoWao force-pushed the refactor/swimlane-cache-line-blocks branch from d057f2c to 0864889 Compare May 31, 2026 07:18
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.

Move the three live phase-header fields directly into the root header:
  - num_sched_threads → num_phase_threads (renamed for clarity; it
    counts phase pools, which equals sched_thread_num or
    aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
  - num_cores → num_phase_cores (disambiguate from the root header's
    pre-existing num_cores — they have different semantics)
  - core_to_thread[PLATFORM_MAX_CORES] — verbatim

Dropped:
  - magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
    on `num_phase_threads > 0` (zero-init means phase init never ran).
  - records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
    but no caller ever read it — dead field.

Shared-memory layout after the merge:
  [L2SwimlaneDataHeader (now includes phase metadata)]
  [L2SwimlaneAicpuTaskPool × num_cores]
  [L2SwimlaneAicoreTaskPool × num_cores]
  [L2SwimlaneAicpuPhasePool × num_phase_threads]   ← was preceded by header

`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.

AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.

Built atop hw-native-sys#939 (ActiveHead refactor).

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
  - sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean
@ChaoWao ChaoWao force-pushed the refactor/swimlane-cache-line-blocks branch from 0864889 to 7a2345a Compare May 31, 2026 07:29
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.

Move the three live phase-header fields directly into the root header:
  - num_sched_threads → num_phase_threads (renamed for clarity; it
    counts phase pools, which equals sched_thread_num or
    aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
  - num_cores → num_phase_cores (disambiguate from the root header's
    pre-existing num_cores — they have different semantics)
  - core_to_thread[PLATFORM_MAX_CORES] — verbatim

Dropped:
  - magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
    on `num_phase_threads > 0` (zero-init means phase init never ran).
  - records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
    but no caller ever read it — dead field.

Shared-memory layout after the merge:
  [L2SwimlaneDataHeader (now includes phase metadata)]
  [L2SwimlaneAicpuTaskPool × num_cores]
  [L2SwimlaneAicoreTaskPool × num_cores]
  [L2SwimlaneAicpuPhasePool × num_phase_threads]   ← was preceded by header

`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.

AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.

Built atop hw-native-sys#939 (ActiveHead refactor).

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
  - sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean
@ChaoWao ChaoWao force-pushed the refactor/swimlane-cache-line-blocks branch from 7a2345a to f681174 Compare May 31, 2026 08:15
Three pool types — AICPU task, AICPU phase, AICore task — previously
defined their own ad-hoc layouts:
  - AICPU/Phase pool: free_queue (128) + scattered counters + pad → 192B
  - AICore pool:      rotation (64) + free_queue (128) + counters + pad → 256B

Extract the common producer-side cache line as `L2SwimlaneActiveHead`
(64B: current_buf_ptr + current_buf_seq + total/dropped record counts).
Every pool is now exactly `head + free_queue = 192B`. AICore pool drops
the standalone `L2SwimlaneAicoreRotation` struct entirely — its old
fields (`current_buf_ptr` + `generation`) merge into the head's
`current_buf_ptr` + `current_buf_seq` (semantics identical: AICore reads
seq to detect a rotation, then reads ptr to pick up the new buffer).

Saves 64B per AICore core (256→192) and removes the duplicated
"buf seq / generation" counter pair that was bumped twice per rotation.

Hot-path verification: AICore still `dcci(head, SINGLE_CACHE_LINE)` —
the head sits in a single cache line by alignment, so per-task cost is
unchanged. False-sharing audit: head is single-writer (AICPU) for all
three pools; readers are either AICore (dcci, AicoreTask pool only) or
host at drain time. AICore never reads the counter fields, so the
invalidation it pays per task is harmless even though counters cohabit
the line.

Renames (atomic with the layout change, so no compat shim):
  set_l2_swimlane_aicore_rotation_slot → set_l2_swimlane_aicore_head_slot
  get_l2_swimlane_aicore_rotation      → get_l2_swimlane_aicore_head
  L2SwimlaneAicoreRotation             → L2SwimlaneActiveHead
  L2SwimlaneAicoreLocalState::cached_generation → cached_buf_seq

KernelArgs::l2_swimlane_aicore_rotation_table field name is preserved
for ABI stability; its comment now notes that slots hold
`L2SwimlaneActiveHead*` addresses.

Init shift: AicoreTask pool's head.current_buf_seq starts at 0 (was:
current_buf_seq=0 + rotation.generation=1). AICore local state's
cached_buf_seq starts at UINT32_MAX so the first record_task call
observes a mismatch and loads the buffer. The two aicore_executor
sites that aggregate-initialize the local state explicitly pass
UINT32_MAX so the in-class default isn't shadowed.

Test plan:
- pytest tests/st/.../dfx --platform a2a3sim --enable-l2-swimlane → all 8 pass
- pytest tests/st/.../dfx/l2_swimlane --platform a2a3sim --enable-l2-swimlane → 2 pass
- pre-commit clean on all touched files
- CI green expected (onboard + sim, a2a3 + a5)
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.

Move the three live phase-header fields directly into the root header:
  - num_sched_threads → num_phase_threads (renamed for clarity; it
    counts phase pools, which equals sched_thread_num or
    aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
  - num_cores → num_phase_cores (disambiguate from the root header's
    pre-existing num_cores — they have different semantics)
  - core_to_thread[PLATFORM_MAX_CORES] — verbatim

Dropped:
  - magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
    on `num_phase_threads > 0` (zero-init means phase init never ran).
  - records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
    but no caller ever read it — dead field.

Shared-memory layout after the merge:
  [L2SwimlaneDataHeader (now includes phase metadata)]
  [L2SwimlaneAicpuTaskPool × num_cores]
  [L2SwimlaneAicoreTaskPool × num_cores]
  [L2SwimlaneAicpuPhasePool × num_phase_threads]   ← was preceded by header

`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.

AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.

Built atop hw-native-sys#939 (ActiveHead refactor).

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
  - sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean
@ChaoWao ChaoWao merged commit dc83d5a into hw-native-sys:main May 31, 2026
15 checks passed
@ChaoWao ChaoWao deleted the refactor/swimlane-cache-line-blocks branch May 31, 2026 08:53
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.

Move the three live phase-header fields directly into the root header:
  - num_sched_threads → num_phase_threads (renamed for clarity; it
    counts phase pools, which equals sched_thread_num or
    aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
  - num_cores → num_phase_cores (disambiguate from the root header's
    pre-existing num_cores — they have different semantics)
  - core_to_thread[PLATFORM_MAX_CORES] — verbatim

Dropped:
  - magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
    on `num_phase_threads > 0` (zero-init means phase init never ran).
  - records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
    but no caller ever read it — dead field.

Shared-memory layout after the merge:
  [L2SwimlaneDataHeader (now includes phase metadata)]
  [L2SwimlaneAicpuTaskPool × num_cores]
  [L2SwimlaneAicoreTaskPool × num_cores]
  [L2SwimlaneAicpuPhasePool × num_phase_threads]   ← was preceded by header

`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.

AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.

Built atop hw-native-sys#939 (ActiveHead refactor).

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
  - sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.

Move the three live phase-header fields directly into the root header:
  - num_sched_threads → num_phase_threads (renamed for clarity; it
    counts phase pools, which equals sched_thread_num or
    aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
  - num_cores → num_phase_cores (disambiguate from the root header's
    pre-existing num_cores — they have different semantics)
  - core_to_thread[PLATFORM_MAX_CORES] — verbatim

Dropped:
  - magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
    on `num_phase_threads > 0` (zero-init means phase init never ran).
  - records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
    but no caller ever read it — dead field.

Shared-memory layout after the merge:
  [L2SwimlaneDataHeader (now includes phase metadata)]
  [L2SwimlaneAicpuTaskPool × num_cores]
  [L2SwimlaneAicoreTaskPool × num_cores]
  [L2SwimlaneAicpuPhasePool × num_phase_threads]   ← was preceded by header

`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.

AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.

Built atop hw-native-sys#939 (ActiveHead refactor).

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
  - sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.

Move the three live phase-header fields directly into the root header:
  - num_sched_threads → num_phase_threads (renamed for clarity; it
    counts phase pools, which equals sched_thread_num or
    aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
  - num_cores → num_phase_cores (disambiguate from the root header's
    pre-existing num_cores — they have different semantics)
  - core_to_thread[PLATFORM_MAX_CORES] — verbatim

Dropped:
  - magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
    on `num_phase_threads > 0` (zero-init means phase init never ran).
  - records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
    but no caller ever read it — dead field.

Shared-memory layout after the merge:
  [L2SwimlaneDataHeader (now includes phase metadata)]
  [L2SwimlaneAicpuTaskPool × num_cores]
  [L2SwimlaneAicoreTaskPool × num_cores]
  [L2SwimlaneAicpuPhasePool × num_phase_threads]   ← was preceded by header

`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.

AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.

Built atop hw-native-sys#939 (ActiveHead refactor).

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
  - sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.

Move the three live phase-header fields directly into the root header:
  - num_sched_threads → num_phase_threads (renamed for clarity; it
    counts phase pools, which equals sched_thread_num or
    aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
  - num_cores → num_phase_cores (disambiguate from the root header's
    pre-existing num_cores — they have different semantics)
  - core_to_thread[PLATFORM_MAX_CORES] — verbatim

Dropped:
  - magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
    on `num_phase_threads > 0` (zero-init means phase init never ran).
  - records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
    but no caller ever read it — dead field.

Shared-memory layout after the merge:
  [L2SwimlaneDataHeader (now includes phase metadata)]
  [L2SwimlaneAicpuTaskPool × num_cores]
  [L2SwimlaneAicoreTaskPool × num_cores]
  [L2SwimlaneAicpuPhasePool × num_phase_threads]   ← was preceded by header

`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.

AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.

Built atop hw-native-sys#939 (ActiveHead refactor).

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
  - sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.

Move the three live phase-header fields directly into the root header:
  - num_sched_threads → num_phase_threads (renamed for clarity; it
    counts phase pools, which equals sched_thread_num or
    aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
  - num_cores → num_phase_cores (disambiguate from the root header's
    pre-existing num_cores — they have different semantics)
  - core_to_thread[PLATFORM_MAX_CORES] — verbatim

Dropped:
  - magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
    on `num_phase_threads > 0` (zero-init means phase init never ran).
  - records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
    but no caller ever read it — dead field.

Shared-memory layout after the merge:
  [L2SwimlaneDataHeader (now includes phase metadata)]
  [L2SwimlaneAicpuTaskPool × num_cores]
  [L2SwimlaneAicoreTaskPool × num_cores]
  [L2SwimlaneAicpuPhasePool × num_phase_threads]   ← was preceded by header

`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.

AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.

Built atop hw-native-sys#939 (ActiveHead refactor).

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
  - sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean
ChaoWao added a commit that referenced this pull request May 31, 2026
…941)

The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.

Move the three live phase-header fields directly into the root header:
  - num_sched_threads → num_phase_threads (renamed for clarity; it
    counts phase pools, which equals sched_thread_num or
    aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
  - num_cores → num_phase_cores (disambiguate from the root header's
    pre-existing num_cores — they have different semantics)
  - core_to_thread[PLATFORM_MAX_CORES] — verbatim

Dropped:
  - magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
    on `num_phase_threads > 0` (zero-init means phase init never ran).
  - records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
    but no caller ever read it — dead field.

Shared-memory layout after the merge:
  [L2SwimlaneDataHeader (now includes phase metadata)]
  [L2SwimlaneAicpuTaskPool × num_cores]
  [L2SwimlaneAicoreTaskPool × num_cores]
  [L2SwimlaneAicpuPhasePool × num_phase_threads]   ← was preceded by header

`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.

AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.

Built atop #939 (ActiveHead refactor).

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
  - sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
ChaoWao added a commit that referenced this pull request May 31, 2026
After #939 (pool unification), #941 (PhaseHeader merge), and #942
(split sched/orch phase records), several comments and doc sections
still referenced the pre-split a2a3 layout. Audit and update:

a2a3 code/comments:
- platform_config.h: PROF_BUFFERS_PER_THREAD doc references both
  SchedPhaseBuffer and OrchPhaseBuffer (was: single PhaseBuffer);
  PROF_READYQUEUE_SIZE comment now says "four kinds"; formula bumped
  by 2x on the per-thread term to cover both sched and orch pool
  enqueues (matches host alloc which iterates both pool arrays).
- l2_swimlane_profiling.h header layout diagram: name the two split
  phase-thread counts.
- l2_swimlane_collector_aicpu.cpp: cross-launch reset comment now
  references s_sched_phase_pools / s_orch_phase_pools (was: single
  s_aicpu_phase_pools) and record_sched_phase / record_orch_phase.
- scheduler_dispatch.cpp / aicpu_executor.cpp: comments reference
  the split record types.

src/common/ shared comments (now mixed-arch):
- profiler_base.h / buffer_pool_manager.h: qualify
  L2SwimlaneAicpuPhaseHeader::magic example as "on a5" since the
  struct no longer exists on a2a3.

docs/dfx/l2-swimlane-profiling.md:
- §5.1: layout block + record list now distinguish a2a3 split shape
  (SchedPhaseRecord 40B + OrchPhaseRecord 32B, two pool arrays) from
  a5's still-unified shape (pending port).
- §5.2: a2a3 buffer-kind list updated to all four kinds (was: two);
  ASCII data-flow diagram redrawn to show split phase records;
  kBufferKinds = 4 in the L2SwimlaneModule trait description.
- §5.3 (a5): num_phase_threads / core_to_thread[] reference corrected
  to live in L2SwimlaneAicpuPhaseHeader on a5 (was wrongly attributed
  to L2SwimlaneDataHeader).
- §5.4: comparison table separates task record (identical) from
  phase record (diverged); ready-queue and kBufferKinds rows
  call out the a2a3=4 vs a5=2 split.
- §6: overhead description differentiates a2a3's per-emit
  SchedPhase + per-submit OrchPhase from a5's unified PhaseRecord
  (was: "4 phases × 40B per iteration", which described a removed
  shape).
- §8 FAQ: "phase records empty" entry gates a2a3 on
  num_{sched,orch}_phase_threads, a5 on PhaseHeader::magic.

No semantic code changes except the READYQUEUE_SIZE formula bump
(adds ~8KB to the header; necessary correctness fix given the second
phase pool).

Test plan:
- pre-commit clean
- onboard l2_swimlane STs (--enable-l2-swimlane --enable-dep-gen): 2 passed
- onboard paged_attention_unroll level 4: 1 passed

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants