Refactor: unify L2 swimlane pools around L2SwimlaneActiveHead cache line#939
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
📝 WalkthroughWalkthroughThis PR refactors L2 swimlane profiling infrastructure from a per-core "rotation" channel model to an "active head" model. A new unified ChangesL2 Swimlane Profiling Rotation to Active Head Migration
Sequence DiagramNo sequence diagrams are generated for this diff: the changes are a cohesive architecture refactoring spanning multiple layers with sufficient complexity that would benefit from the layer-by-layer review walkthrough rather than a single high-level flow diagram. 🎯 4 (Complex) | ⏱️ ~75 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp (1)
303-315:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't count the current AICore buffer as dropped on rotation failure.
Lines 309-310 and Lines 331-332 add
PLATFORM_AICORE_BUFFER_SIZEtodropped_record_count, but these branches leavehead.current_buf_ptrandhead.current_buf_sequnchanged. That same full buffer is still recoverable inl2_swimlane_aicpu_flush()via thelive = total_record_count - current_buf_seq * BUFFER_SIZEpath, so a later successful flush will double-account the records and can make reconcile reportaccounted > total_device.Suggested direction
- ac_state->head.dropped_record_count = - ac_state->head.dropped_record_count + static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE); + // Keep the full buffer recoverable; only account overflow attempts once + // we know how many writes actually happened past BUFFER_SIZE.// In the flush path, derive actual overflow after a failed rotation: uint32_t live = ac_state->head.total_record_count - ac_state->head.current_buf_seq * static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE); uint32_t overflow = (live > static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE)) ? (live - static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE)) : 0; ac_state->head.dropped_record_count += overflow; uint32_t ac_mark = std::min(live, static_cast<uint32_t>(PLATFORM_AICORE_BUFFER_SIZE));Also applies to: 326-333
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp` around lines 303 - 315, The current rotation-failure branches in the AICPU collector (where ac_state->head.dropped_record_count is increased by PLATFORM_AICORE_BUFFER_SIZE) incorrectly count the entire buffer as dropped while head.current_buf_ptr/head.current_buf_seq remain unchanged; instead, in the rotation-failure path compute the actual overflow using the same logic used in l2_swimlane_aicpu_flush: derive live = ac_state->head.total_record_count - ac_state->head.current_buf_seq * PLATFORM_AICORE_BUFFER_SIZE, compute overflow = max(0, live - PLATFORM_AICORE_BUFFER_SIZE), add only that overflow to ac_state->head.dropped_record_count, and set the effective mark (ac_mark) to min(live, PLATFORM_AICORE_BUFFER_SIZE); apply this fix to both places modifying dropped_record_count (the branch around head==tail and the other similar branch).
🧹 Nitpick comments (5)
src/a2a3/platform/include/common/l2_swimlane_profiling.h (2)
271-304: ⚡ Quick winAdd offset assertions for the pool ABI.
sizeof(...) == 192will not catch a future reorder ofheadandfree_queue. This refactor depends onheadstaying at offset 0 andfree_queueat 64 for&pool.headpublication and the shared-memory address helpers, so please lock those offsets down withoffsetofstatic_asserts.Suggested guardrails
+#include <cstddef> `#include` <cstdint> `#include` <vector> @@ static_assert(sizeof(L2SwimlaneAicpuTaskPool) == 192, "L2SwimlaneAicpuTaskPool must be 192 bytes"); +static_assert(offsetof(L2SwimlaneAicpuTaskPool, head) == 0, "L2SwimlaneAicpuTaskPool::head must stay first"); +static_assert( + offsetof(L2SwimlaneAicpuTaskPool, free_queue) == 64, + "L2SwimlaneAicpuTaskPool::free_queue must stay in the second cache line" +); @@ static_assert(sizeof(L2SwimlaneAicoreTaskPool) == 192, "L2SwimlaneAicoreTaskPool must be 192 bytes"); +static_assert(offsetof(L2SwimlaneAicoreTaskPool, head) == 0, "L2SwimlaneAicoreTaskPool::head must stay first"); +static_assert( + offsetof(L2SwimlaneAicoreTaskPool, free_queue) == 64, + "L2SwimlaneAicoreTaskPool::free_queue must stay in the second cache line" +);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/include/common/l2_swimlane_profiling.h` around lines 271 - 304, Add explicit offsetof static_asserts to lock field offsets for both L2SwimlaneAicpuTaskPool and L2SwimlaneAicoreTaskPool: assert that offsetof(L2SwimlaneAicpuTaskPool, head) == 0 and offsetof(..., free_queue) == 64, and likewise for L2SwimlaneAicoreTaskPool, so the ABI guarantees that &pool.head is at base and free_queue is at byte 64; place these asserts next to the existing sizeof static_asserts and reference the struct names L2SwimlaneAicpuTaskPool and L2SwimlaneAicoreTaskPool to ensure future reorders fail compile-time.
217-298: ⚡ Quick winUpdate the header-level layout diagram to match the new schema.
The new
ActiveHeadcomments are clear, but the file-level memory map at Lines 18-52 still describes the pre-refactor per-pool fields and omits the AICore pool region. In this header that diagram is effectively the shared-memory ABI map, so it is worth keeping it in sync in the same change.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/include/common/l2_swimlane_profiling.h` around lines 217 - 298, Update the file-level shared-memory layout diagram to reflect the refactor: replace the old per-pool field listing with the new unified cache-line head (L2SwimlaneActiveHead) plus free-queue layout and include the AICore pool region so the ABI map covers all pool kinds (L2SwimlaneAicpuTaskPool / L2SwimlaneAicpuPhasePool / L2SwimlaneAicoreTaskPool). Make the diagram show ActiveHead (64B) followed by L2SwimlaneFreeQueue (128B) per-pool, note the fields present in L2SwimlaneActiveHead (current_buf_ptr, current_buf_seq, total_record_count, dropped_record_count), and indicate the overall per-pool size (192B) to match the static_asserts for L2SwimlaneAicpuTaskPool.src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h (1)
73-75: ⚡ Quick winFix the
headlifecycle comment.This says
headis cached at kernel entry, butaicore_profiling_state.hnow documents that only the slot pointer is stashed there and the head itself is resolved lazily after first dispatch. Keeping the two public headers aligned matters here because callers are timing-sensitive.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h` around lines 73 - 75, The comment for parameter head is inaccurate: it states the full L2SwimlaneActiveHead is cached at kernel entry, but upstream aicore_profiling_state.h documents that only the slot pointer is stashed and the actual head is resolved lazily on first dispatch; update the param doc for head in l2_swimlane_collector_aicore.h to state that KernelArgs::l2_swimlane_aicore_rotation_table[block_idx] stores only the slot pointer and that the L2SwimlaneActiveHead is resolved lazily (not fully cached) to match the behavior of the lazy resolution in aicore_profiling_state.h so callers know the timing implications.src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h (1)
70-72: ⚡ Quick winFinish the terminology rename in this header.
This paragraph uses
L2SwimlaneActiveHead, but the rotation-table comment immediately above still says AICPU publishes&L2SwimlaneAicoreTaskPool::rotation. Please update that block too so this public header describes a single contract.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h` around lines 70 - 72, The header comment is inconsistent: one paragraph references L2SwimlaneActiveHead while the rotation-table comment still mentions L2SwimlaneAicoreTaskPool::rotation; update the rotation-table comment so both describe the same public contract name (use L2SwimlaneActiveHead consistently), e.g., change any mentions of L2SwimlaneAicoreTaskPool::rotation to L2SwimlaneActiveHead and ensure the description of who publishes/consumes that channel matches the paragraph that says the per-core rotation channel is primed by popping from L2SwimlaneAicoreTaskPool::free_queue and writing its address into L2SwimlaneActiveHead.src/a2a3/platform/src/host/l2_swimlane_collector.cpp (1)
244-250: ⚡ Quick winUpdate the comment to the renamed head field.
This block now describes
L2SwimlaneActiveHeadaddresses, but it still says AICPU has direct access to&ac_state->rotation. That field was removed in this refactor, so the comment now points readers at the wrong object.Small doc fix
- // direct access to `&ac_state->rotation` device addresses, no + // direct access to `&ac_state->head` device addresses, no🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/src/host/l2_swimlane_collector.cpp` around lines 244 - 250, The comment refers to the removed field &ac_state->rotation; update it to reference the renamed head field so it correctly describes where AICPU writes L2SwimlaneActiveHead device addresses. Specifically, in the block describing the standalone uint64_t[num_aicore] table and KernelArgs::l2_swimlane_aicore_rotation_table, replace the reference to &ac_state->rotation with the actual renamed member on ac_state (use the exact identifier introduced by the refactor), and keep the rest of the explanation intact (AICPU fills entries in l2_swimlane_aicpu_init and AICore reads rotation_table[block_idx] at kernel entry). Ensure you mention L2SwimlaneActiveHead, l2_swimlane_aicpu_init, and KernelArgs::l2_swimlane_aicore_rotation_table so readers can find the related code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp`:
- Around line 303-315: The current rotation-failure branches in the AICPU
collector (where ac_state->head.dropped_record_count is increased by
PLATFORM_AICORE_BUFFER_SIZE) incorrectly count the entire buffer as dropped
while head.current_buf_ptr/head.current_buf_seq remain unchanged; instead, in
the rotation-failure path compute the actual overflow using the same logic used
in l2_swimlane_aicpu_flush: derive live = ac_state->head.total_record_count -
ac_state->head.current_buf_seq * PLATFORM_AICORE_BUFFER_SIZE, compute overflow =
max(0, live - PLATFORM_AICORE_BUFFER_SIZE), add only that overflow to
ac_state->head.dropped_record_count, and set the effective mark (ac_mark) to
min(live, PLATFORM_AICORE_BUFFER_SIZE); apply this fix to both places modifying
dropped_record_count (the branch around head==tail and the other similar
branch).
---
Nitpick comments:
In `@src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h`:
- Around line 73-75: The comment for parameter head is inaccurate: it states the
full L2SwimlaneActiveHead is cached at kernel entry, but upstream
aicore_profiling_state.h documents that only the slot pointer is stashed and the
actual head is resolved lazily on first dispatch; update the param doc for head
in l2_swimlane_collector_aicore.h to state that
KernelArgs::l2_swimlane_aicore_rotation_table[block_idx] stores only the slot
pointer and that the L2SwimlaneActiveHead is resolved lazily (not fully cached)
to match the behavior of the lazy resolution in aicore_profiling_state.h so
callers know the timing implications.
In `@src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h`:
- Around line 70-72: The header comment is inconsistent: one paragraph
references L2SwimlaneActiveHead while the rotation-table comment still mentions
L2SwimlaneAicoreTaskPool::rotation; update the rotation-table comment so both
describe the same public contract name (use L2SwimlaneActiveHead consistently),
e.g., change any mentions of L2SwimlaneAicoreTaskPool::rotation to
L2SwimlaneActiveHead and ensure the description of who publishes/consumes that
channel matches the paragraph that says the per-core rotation channel is primed
by popping from L2SwimlaneAicoreTaskPool::free_queue and writing its address
into L2SwimlaneActiveHead.
In `@src/a2a3/platform/include/common/l2_swimlane_profiling.h`:
- Around line 271-304: Add explicit offsetof static_asserts to lock field
offsets for both L2SwimlaneAicpuTaskPool and L2SwimlaneAicoreTaskPool: assert
that offsetof(L2SwimlaneAicpuTaskPool, head) == 0 and offsetof(..., free_queue)
== 64, and likewise for L2SwimlaneAicoreTaskPool, so the ABI guarantees that
&pool.head is at base and free_queue is at byte 64; place these asserts next to
the existing sizeof static_asserts and reference the struct names
L2SwimlaneAicpuTaskPool and L2SwimlaneAicoreTaskPool to ensure future reorders
fail compile-time.
- Around line 217-298: Update the file-level shared-memory layout diagram to
reflect the refactor: replace the old per-pool field listing with the new
unified cache-line head (L2SwimlaneActiveHead) plus free-queue layout and
include the AICore pool region so the ABI map covers all pool kinds
(L2SwimlaneAicpuTaskPool / L2SwimlaneAicpuPhasePool / L2SwimlaneAicoreTaskPool).
Make the diagram show ActiveHead (64B) followed by L2SwimlaneFreeQueue (128B)
per-pool, note the fields present in L2SwimlaneActiveHead (current_buf_ptr,
current_buf_seq, total_record_count, dropped_record_count), and indicate the
overall per-pool size (192B) to match the static_asserts for
L2SwimlaneAicpuTaskPool.
In `@src/a2a3/platform/src/host/l2_swimlane_collector.cpp`:
- Around line 244-250: The comment refers to the removed field
&ac_state->rotation; update it to reference the renamed head field so it
correctly describes where AICPU writes L2SwimlaneActiveHead device addresses.
Specifically, in the block describing the standalone uint64_t[num_aicore] table
and KernelArgs::l2_swimlane_aicore_rotation_table, replace the reference to
&ac_state->rotation with the actual renamed member on ac_state (use the exact
identifier introduced by the refactor), and keep the rest of the explanation
intact (AICPU fills entries in l2_swimlane_aicpu_init and AICore reads
rotation_table[block_idx] at kernel entry). Ensure you mention
L2SwimlaneActiveHead, l2_swimlane_aicpu_init, and
KernelArgs::l2_swimlane_aicore_rotation_table so readers can find the related
code.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c8db322d-6f52-48d1-9872-8f3a3f2e8b54
📒 Files selected for processing (12)
src/a2a3/platform/include/aicore/aicore_profiling_state.hsrc/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.hsrc/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.hsrc/a2a3/platform/include/common/l2_swimlane_profiling.hsrc/a2a3/platform/include/host/l2_swimlane_collector.hsrc/a2a3/platform/onboard/aicore/kernel.cppsrc/a2a3/platform/sim/aicore/kernel.cppsrc/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cppsrc/a2a3/platform/src/host/l2_swimlane_collector.cppsrc/a2a3/runtime/host_build_graph/aicore/aicore_executor.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
d057f2c to
0864889
Compare
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.
Move the three live phase-header fields directly into the root header:
- num_sched_threads → num_phase_threads (renamed for clarity; it
counts phase pools, which equals sched_thread_num or
aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
- num_cores → num_phase_cores (disambiguate from the root header's
pre-existing num_cores — they have different semantics)
- core_to_thread[PLATFORM_MAX_CORES] — verbatim
Dropped:
- magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
on `num_phase_threads > 0` (zero-init means phase init never ran).
- records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
but no caller ever read it — dead field.
Shared-memory layout after the merge:
[L2SwimlaneDataHeader (now includes phase metadata)]
[L2SwimlaneAicpuTaskPool × num_cores]
[L2SwimlaneAicoreTaskPool × num_cores]
[L2SwimlaneAicpuPhasePool × num_phase_threads] ← was preceded by header
`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.
AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.
Built atop hw-native-sys#939 (ActiveHead refactor).
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
- sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
0864889 to
7a2345a
Compare
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.
Move the three live phase-header fields directly into the root header:
- num_sched_threads → num_phase_threads (renamed for clarity; it
counts phase pools, which equals sched_thread_num or
aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
- num_cores → num_phase_cores (disambiguate from the root header's
pre-existing num_cores — they have different semantics)
- core_to_thread[PLATFORM_MAX_CORES] — verbatim
Dropped:
- magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
on `num_phase_threads > 0` (zero-init means phase init never ran).
- records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
but no caller ever read it — dead field.
Shared-memory layout after the merge:
[L2SwimlaneDataHeader (now includes phase metadata)]
[L2SwimlaneAicpuTaskPool × num_cores]
[L2SwimlaneAicoreTaskPool × num_cores]
[L2SwimlaneAicpuPhasePool × num_phase_threads] ← was preceded by header
`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.
AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.
Built atop hw-native-sys#939 (ActiveHead refactor).
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
- sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
7a2345a to
f681174
Compare
Three pool types — AICPU task, AICPU phase, AICore task — previously defined their own ad-hoc layouts: - AICPU/Phase pool: free_queue (128) + scattered counters + pad → 192B - AICore pool: rotation (64) + free_queue (128) + counters + pad → 256B Extract the common producer-side cache line as `L2SwimlaneActiveHead` (64B: current_buf_ptr + current_buf_seq + total/dropped record counts). Every pool is now exactly `head + free_queue = 192B`. AICore pool drops the standalone `L2SwimlaneAicoreRotation` struct entirely — its old fields (`current_buf_ptr` + `generation`) merge into the head's `current_buf_ptr` + `current_buf_seq` (semantics identical: AICore reads seq to detect a rotation, then reads ptr to pick up the new buffer). Saves 64B per AICore core (256→192) and removes the duplicated "buf seq / generation" counter pair that was bumped twice per rotation. Hot-path verification: AICore still `dcci(head, SINGLE_CACHE_LINE)` — the head sits in a single cache line by alignment, so per-task cost is unchanged. False-sharing audit: head is single-writer (AICPU) for all three pools; readers are either AICore (dcci, AicoreTask pool only) or host at drain time. AICore never reads the counter fields, so the invalidation it pays per task is harmless even though counters cohabit the line. Renames (atomic with the layout change, so no compat shim): set_l2_swimlane_aicore_rotation_slot → set_l2_swimlane_aicore_head_slot get_l2_swimlane_aicore_rotation → get_l2_swimlane_aicore_head L2SwimlaneAicoreRotation → L2SwimlaneActiveHead L2SwimlaneAicoreLocalState::cached_generation → cached_buf_seq KernelArgs::l2_swimlane_aicore_rotation_table field name is preserved for ABI stability; its comment now notes that slots hold `L2SwimlaneActiveHead*` addresses. Init shift: AicoreTask pool's head.current_buf_seq starts at 0 (was: current_buf_seq=0 + rotation.generation=1). AICore local state's cached_buf_seq starts at UINT32_MAX so the first record_task call observes a mismatch and loads the buffer. The two aicore_executor sites that aggregate-initialize the local state explicitly pass UINT32_MAX so the in-class default isn't shadowed. Test plan: - pytest tests/st/.../dfx --platform a2a3sim --enable-l2-swimlane → all 8 pass - pytest tests/st/.../dfx/l2_swimlane --platform a2a3sim --enable-l2-swimlane → 2 pass - pre-commit clean on all touched files - CI green expected (onboard + sim, a2a3 + a5)
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.
Move the three live phase-header fields directly into the root header:
- num_sched_threads → num_phase_threads (renamed for clarity; it
counts phase pools, which equals sched_thread_num or
aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
- num_cores → num_phase_cores (disambiguate from the root header's
pre-existing num_cores — they have different semantics)
- core_to_thread[PLATFORM_MAX_CORES] — verbatim
Dropped:
- magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
on `num_phase_threads > 0` (zero-init means phase init never ran).
- records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
but no caller ever read it — dead field.
Shared-memory layout after the merge:
[L2SwimlaneDataHeader (now includes phase metadata)]
[L2SwimlaneAicpuTaskPool × num_cores]
[L2SwimlaneAicoreTaskPool × num_cores]
[L2SwimlaneAicpuPhasePool × num_phase_threads] ← was preceded by header
`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.
AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.
Built atop hw-native-sys#939 (ActiveHead refactor).
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
- sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.
Move the three live phase-header fields directly into the root header:
- num_sched_threads → num_phase_threads (renamed for clarity; it
counts phase pools, which equals sched_thread_num or
aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
- num_cores → num_phase_cores (disambiguate from the root header's
pre-existing num_cores — they have different semantics)
- core_to_thread[PLATFORM_MAX_CORES] — verbatim
Dropped:
- magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
on `num_phase_threads > 0` (zero-init means phase init never ran).
- records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
but no caller ever read it — dead field.
Shared-memory layout after the merge:
[L2SwimlaneDataHeader (now includes phase metadata)]
[L2SwimlaneAicpuTaskPool × num_cores]
[L2SwimlaneAicoreTaskPool × num_cores]
[L2SwimlaneAicpuPhasePool × num_phase_threads] ← was preceded by header
`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.
AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.
Built atop hw-native-sys#939 (ActiveHead refactor).
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
- sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.
Move the three live phase-header fields directly into the root header:
- num_sched_threads → num_phase_threads (renamed for clarity; it
counts phase pools, which equals sched_thread_num or
aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
- num_cores → num_phase_cores (disambiguate from the root header's
pre-existing num_cores — they have different semantics)
- core_to_thread[PLATFORM_MAX_CORES] — verbatim
Dropped:
- magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
on `num_phase_threads > 0` (zero-init means phase init never ran).
- records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
but no caller ever read it — dead field.
Shared-memory layout after the merge:
[L2SwimlaneDataHeader (now includes phase metadata)]
[L2SwimlaneAicpuTaskPool × num_cores]
[L2SwimlaneAicoreTaskPool × num_cores]
[L2SwimlaneAicpuPhasePool × num_phase_threads] ← was preceded by header
`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.
AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.
Built atop hw-native-sys#939 (ActiveHead refactor).
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
- sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.
Move the three live phase-header fields directly into the root header:
- num_sched_threads → num_phase_threads (renamed for clarity; it
counts phase pools, which equals sched_thread_num or
aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
- num_cores → num_phase_cores (disambiguate from the root header's
pre-existing num_cores — they have different semantics)
- core_to_thread[PLATFORM_MAX_CORES] — verbatim
Dropped:
- magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
on `num_phase_threads > 0` (zero-init means phase init never ran).
- records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
but no caller ever read it — dead field.
Shared-memory layout after the merge:
[L2SwimlaneDataHeader (now includes phase metadata)]
[L2SwimlaneAicpuTaskPool × num_cores]
[L2SwimlaneAicoreTaskPool × num_cores]
[L2SwimlaneAicpuPhasePool × num_phase_threads] ← was preceded by header
`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.
AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.
Built atop hw-native-sys#939 (ActiveHead refactor).
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
- sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.
Move the three live phase-header fields directly into the root header:
- num_sched_threads → num_phase_threads (renamed for clarity; it
counts phase pools, which equals sched_thread_num or
aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
- num_cores → num_phase_cores (disambiguate from the root header's
pre-existing num_cores — they have different semantics)
- core_to_thread[PLATFORM_MAX_CORES] — verbatim
Dropped:
- magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
on `num_phase_threads > 0` (zero-init means phase init never ran).
- records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
but no caller ever read it — dead field.
Shared-memory layout after the merge:
[L2SwimlaneDataHeader (now includes phase metadata)]
[L2SwimlaneAicpuTaskPool × num_cores]
[L2SwimlaneAicoreTaskPool × num_cores]
[L2SwimlaneAicpuPhasePool × num_phase_threads] ← was preceded by header
`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.
AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.
Built atop hw-native-sys#939 (ActiveHead refactor).
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
- sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.
Move the three live phase-header fields directly into the root header:
- num_sched_threads → num_phase_threads (renamed for clarity; it
counts phase pools, which equals sched_thread_num or
aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
- num_cores → num_phase_cores (disambiguate from the root header's
pre-existing num_cores — they have different semantics)
- core_to_thread[PLATFORM_MAX_CORES] — verbatim
Dropped:
- magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
on `num_phase_threads > 0` (zero-init means phase init never ran).
- records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
but no caller ever read it — dead field.
Shared-memory layout after the merge:
[L2SwimlaneDataHeader (now includes phase metadata)]
[L2SwimlaneAicpuTaskPool × num_cores]
[L2SwimlaneAicoreTaskPool × num_cores]
[L2SwimlaneAicpuPhasePool × num_phase_threads] ← was preceded by header
`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.
AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.
Built atop hw-native-sys#939 (ActiveHead refactor).
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
- sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
…941) The standalone phase header was a vestige of when phase profiling was an add-on bolted onto the shared-memory layout. Phase metadata is now co-equal with task pool metadata, so the dedicated cache line + magic gate + indirection are pure overhead. Move the three live phase-header fields directly into the root header: - num_sched_threads → num_phase_threads (renamed for clarity; it counts phase pools, which equals sched_thread_num or aicpu_thread_num depending on PTO2_ORCH_TO_SCHED) - num_cores → num_phase_cores (disambiguate from the root header's pre-existing num_cores — they have different semantics) - core_to_thread[PLATFORM_MAX_CORES] — verbatim Dropped: - magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates on `num_phase_threads > 0` (zero-init means phase init never ran). - records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD but no caller ever read it — dead field. Shared-memory layout after the merge: [L2SwimlaneDataHeader (now includes phase metadata)] [L2SwimlaneAicpuTaskPool × num_cores] [L2SwimlaneAicoreTaskPool × num_cores] [L2SwimlaneAicpuPhasePool × num_phase_threads] ← was preceded by header `get_phase_header()` is deleted; `get_phase_buffer_states()` skips straight from the AicoreTaskPool array to the phase pools. AICPU collector keeps a separate `s_phase_initialized` bool so gated paths can check init-ran without re-reading the device-shared header on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header == nullptr` check. Built atop #939 (ActiveHead refactor). Test plan: - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass - sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass - pre-commit clean Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
After #939 (pool unification), #941 (PhaseHeader merge), and #942 (split sched/orch phase records), several comments and doc sections still referenced the pre-split a2a3 layout. Audit and update: a2a3 code/comments: - platform_config.h: PROF_BUFFERS_PER_THREAD doc references both SchedPhaseBuffer and OrchPhaseBuffer (was: single PhaseBuffer); PROF_READYQUEUE_SIZE comment now says "four kinds"; formula bumped by 2x on the per-thread term to cover both sched and orch pool enqueues (matches host alloc which iterates both pool arrays). - l2_swimlane_profiling.h header layout diagram: name the two split phase-thread counts. - l2_swimlane_collector_aicpu.cpp: cross-launch reset comment now references s_sched_phase_pools / s_orch_phase_pools (was: single s_aicpu_phase_pools) and record_sched_phase / record_orch_phase. - scheduler_dispatch.cpp / aicpu_executor.cpp: comments reference the split record types. src/common/ shared comments (now mixed-arch): - profiler_base.h / buffer_pool_manager.h: qualify L2SwimlaneAicpuPhaseHeader::magic example as "on a5" since the struct no longer exists on a2a3. docs/dfx/l2-swimlane-profiling.md: - §5.1: layout block + record list now distinguish a2a3 split shape (SchedPhaseRecord 40B + OrchPhaseRecord 32B, two pool arrays) from a5's still-unified shape (pending port). - §5.2: a2a3 buffer-kind list updated to all four kinds (was: two); ASCII data-flow diagram redrawn to show split phase records; kBufferKinds = 4 in the L2SwimlaneModule trait description. - §5.3 (a5): num_phase_threads / core_to_thread[] reference corrected to live in L2SwimlaneAicpuPhaseHeader on a5 (was wrongly attributed to L2SwimlaneDataHeader). - §5.4: comparison table separates task record (identical) from phase record (diverged); ready-queue and kBufferKinds rows call out the a2a3=4 vs a5=2 split. - §6: overhead description differentiates a2a3's per-emit SchedPhase + per-submit OrchPhase from a5's unified PhaseRecord (was: "4 phases × 40B per iteration", which described a removed shape). - §8 FAQ: "phase records empty" entry gates a2a3 on num_{sched,orch}_phase_threads, a5 on PhaseHeader::magic. No semantic code changes except the READYQUEUE_SIZE formula bump (adds ~8KB to the header; necessary correctness fix given the second phase pool). Test plan: - pre-commit clean - onboard l2_swimlane STs (--enable-l2-swimlane --enable-dep-gen): 2 passed - onboard paged_attention_unroll level 4: 1 passed Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Summary
Three pool types (AICPU task, AICPU phase, AICore task) previously had ad-hoc layouts. AICPU/Phase pool was 192B (
free_queue + scattered counters + pad); AICore pool was 256B (rotation + free_queue + counters + pad).This PR extracts the common producer-side cache line as
L2SwimlaneActiveHead(64B:current_buf_ptr + current_buf_seq + total/dropped record counts). Every pool is now exactlyhead + free_queue = 192B. The standaloneL2SwimlaneAicoreRotationstruct is deleted — its old fields merge into the head'scurrent_buf_ptr+current_buf_seq(semantics identical: AICore reads seq to detect a rotation, then reads ptr to pick up the new buffer).Savings: 64B per AICore core (256→192). Removes the duplicated "buf_seq / generation" counter pair that was bumped twice per rotation.
Not a pure rename: aicore_rotate publish ordering strengthened
The new publish sequence in
aicore_rotateadds an explicitwmb()between the ptr write and the seq write:The old code had two adjacent volatile stores (
rotation.current_buf_ptrthenrotation.generation) under a single trailingwmb(). On ARM/aarch64's weak memory model this leaves room for AICore'sdcci+read to observe the new seq with a stale ptr. The added fence closes that window.The init-prime path uses the same publish pattern now (ptr → wmb → seq → wmb), for symmetry and future-proofing against changes that let AICore poll before the first dispatch.
Pre-existing bug fix: drop the pre-emptive
dropped += BUFFER_SIZEWhen
aicore_rotatefailed (free queue empty OR ready queue full) the old code pre-emptively bumpeddropped_record_count += PLATFORM_AICORE_BUFFER_SIZE. That over-counted whenever:collected += BUFFER_SIZEanddropped += BUFFER_SIZEfor the same records, breaking thecollected + dropped == totalreconcile invariant.Fix: remove the pre-emptive bump in both failure branches. The slot guard on AICore silently drops further records (which AICPU has no precise count of), and reconcile reports the gap as
silent_loss = total - collected - dropped. The ready-queue-full branch additionally now lets the run-end flush retry the enqueue (the host may have drained by then) rather than pre-counting the records as lost.A regression test exercising forced rotation failures is followed up in a separate issue (the trigger is hard to set up in the existing 5-task vector example).
Hot-path verification
dcci(head, SINGLE_CACHE_LINE)— head fits one cache line byalignas(64), so per-task cost is unchanged.offsetofstatic_asserts added on both pool types to lockhead@0/free_queue@64— silent layout drift would corrupt the AICore-readable head address the AICPU init publishes into the rotation table.A perf measurement (
paged_attention_unroll --enable-l2-swimlane 4vs main) is followed up in a separate issue to validate that the cache-line sharing of head + counters doesn't regress AICore dcci throughput.Renames (atomic with the layout change — no compat shim)
set_l2_swimlane_aicore_rotation_slotset_l2_swimlane_aicore_head_slotget_l2_swimlane_aicore_rotationget_l2_swimlane_aicore_headL2SwimlaneAicoreRotationL2SwimlaneActiveHeadL2SwimlaneAicoreLocalState::cached_generationcached_buf_seqKernelArgs::l2_swimlane_aicore_rotation_tablefield name preserved for ABI stability; its comment now notes that slots holdL2SwimlaneActiveHead*addresses.Init shift (semantically equivalent)
head.current_buf_seq = 0at init (was:current_buf_seq=0+rotation.generation=1separately).cached_buf_seqdefault isUINT32_MAXso the firstrecord_taskobserves a mismatch and loads the buffer.aicore_executor.cppsites that aggregate-initialize the local state explicitly passUINT32_MAXso the in-class default isn't shadowed.Doc consistency sweep
docs/dfx/l2-swimlane-profiling.md— replaceL2SwimlaneAicoreRotation/generationreferences withL2SwimlaneActiveHead/current_buf_seq.aicore_profiling_state.h/l2_swimlane_collector_aicpu.h/l2_swimlane_collector_aicore.h— same.l2_swimlane_collector_aicore.h::record_taskdocstring — clarify thatheadis lazy-resolved on the first task, not cached at kernel entry.l2_swimlane_profiling.hupdated to show the new layout.Test plan
pytest tests/st/.../dfx --platform a2a3sim --enable-l2-swimlanepytest tests/st/.../dfx/l2_swimlane --platform a2a3sim --enable-l2-swimlanepre-commit runclean on all touched filesaicore_rotatefailure paths (issue forthcoming)paged_attention_unrollto validate head+counter cache-line sharing (issue forthcoming)