Refactor: merge L2SwimlaneAicpuPhaseHeader into L2SwimlaneDataHeader by hw-native-sys-bot · Pull Request #941 · hw-native-sys/simpler

hw-native-sys-bot · 2026-05-31T07:15:29Z

Summary

The standalone L2SwimlaneAicpuPhaseHeader was a vestige of when phase profiling was an add-on bolted onto the shared-memory layout. Phase metadata is now co-equal with task pool metadata, so the dedicated cache line + magic gate + indirection are pure overhead.

This PR moves the three live phase-header fields directly into L2SwimlaneDataHeader:

Old field (PhaseHeader)	New field (DataHeader)	Note
`num_sched_threads`	`num_phase_threads`	renamed for clarity; counts phase pools — equals `sched_thread_num_` or `aicpu_thread_num_` depending on `PTO2_ORCH_TO_SCHED`
`num_cores`	`num_phase_cores`	disambiguated from root header's pre-existing `num_cores` (different semantics)
`core_to_thread[PLATFORM_MAX_CORES]`	`core_to_thread[PLATFORM_MAX_CORES]`	verbatim

Dropped

magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates on num_phase_threads > 0 (zero-init means phase init never ran).
records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD but no caller ever read it — dead field.

Shared-memory layout

[L2SwimlaneDataHeader (now includes phase metadata)]
[L2SwimlaneAicpuTaskPool × num_cores]
[L2SwimlaneAicoreTaskPool × num_cores]
[L2SwimlaneAicpuPhasePool × num_phase_threads]   ← was preceded by PhaseHeader

get_phase_header() is deleted. get_phase_buffer_states() skips straight from the AicoreTaskPool array to the phase pools.

AICPU collector init-ran gate

Phase-gated AICPU paths previously checked s_l2_swimlane_aicpu_phase_header == nullptr. After the merge they check a new s_phase_initialized static bool, so the hot path doesn't re-read the device-shared header just to test init-ran.

Dependency on #939

This branch is stacked on top of refactor/swimlane-cache-line-blocks (#939). The PR's base is main (GitHub won't let us target a fork branch as base), so the current diff temporarily includes #939's 11 files on top of this PR's 5 files. After #939 merges, this PR's diff will auto-clean to just C's 5 files (≈75 added / 104 deleted). Review #939 first; only the 5 phase-header files are this PR's contribution:

src/a2a3/platform/include/common/l2_swimlane_profiling.h
src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h
src/a2a3/platform/include/host/l2_swimlane_collector.h
src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp
src/a2a3/platform/src/host/l2_swimlane_collector.cpp

Test plan

sim swimlane ST passes: pytest tests/st/.../dfx/l2_swimlane --platform a2a3sim --enable-l2-swimlane
sim DFX (scope_stats / tensor_dump / pmu / dep_gen) passes
pre-commit run clean
CI green (onboard + sim, a2a3)
a5 port: not in this PR (a5 has the same PhaseHeader pattern; will land via a separate a5 sweep)

coderabbitai · 2026-05-31T07:15:36Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e860de41-9e1e-4055-b8de-fb948eec45bf

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Phase profiling metadata is consolidated from a standalone L2SwimlaneAicpuPhaseHeader into L2SwimlaneDataHeader across the swimlane collector and AICPU recording paths. AICPU init and core-assignment operations now update the shared header directly and gate on an s_phase_initialized flag. Host-side metadata reading sources from the same header location without magic validation, and memory layout calculations are updated accordingly.

Changes

Phase metadata structure consolidation

Layer / File(s)	Summary
Phase metadata struct and layout `src/a2a3/platform/include/common/l2_swimlane_profiling.h`	`L2SwimlaneDataHeader` gains `num_phase_threads`, `num_phase_cores`, and `core_to_thread[]` fields; standalone `L2SwimlaneAicpuPhaseHeader` struct and `L2_SWIMLANE_AICPU_PHASE_MAGIC` constant are removed. Memory-size calculation and phase buffer accessors are updated to treat phase pools as immediately following the AICore task pool, eliminating intermediate phase-header offsets.
AICPU phase initialization and recording `src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp`, `src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h`	AICPU side introduces `s_phase_initialized` flag and cached `s_l2_swimlane_header` pointer, replacing cached phase-header pointer. `l2_swimlane_aicpu_init_phase` writes phase metadata directly into the shared header and marks initialization complete. Core-assignment functions write core mappings into the header. Phase recording functions gate on the initialized flag instead of header-pointer nullness.
Host-side phase metadata initialization and reading `src/a2a3/platform/src/host/l2_swimlane_collector.cpp`, `src/a2a3/platform/include/host/l2_swimlane_collector.h`	Host collector explicitly initializes phase metadata fields in `L2SwimlaneDataHeader` during setup. `read_phase_header_metadata()` sources metadata directly from the shared header instead of a separate phase header, removing magic-constant validation. `for_each_instance` gates on `num_phase_threads` from the main header, and core-to-thread mapping is derived from header-resident `num_phase_cores` and array.
Documentation updates `src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h`, `src/a2a3/platform/include/host/l2_swimlane_collector.h`, `src/a2a3/platform/src/host/l2_swimlane_collector.cpp`	Comments across AICPU header, host collector header, and implementation clarify that phase metadata is now written to and read from `L2SwimlaneDataHeader` fields rather than a separate phase-header structure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

hw-native-sys/simpler#939: Both PRs modify the shared swimlane profiling header (l2_swimlane_profiling.h)—Refactor: unify L2 swimlane pools around L2SwimlaneActiveHead cache line #939 refactors pool/layout structures, while the main PR further consolidates phase metadata into the central header and updates size/accessor calculations.

Poem

🐰 A rabbit hops through swimlane streams,
Phase headers merge like gentle dreams,
One central home for metadata's tale,
No magic needed—the truth will not fail!
Simpler, swifter, cleaner by design! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and accurately summarizes the main refactoring: moving L2SwimlaneAicpuPhaseHeader fields into L2SwimlaneDataHeader.
Description check	✅ Passed	The description thoroughly explains the refactoring rationale, field mappings, dropped fields, shared-memory layout changes, and gating logic changes, all directly relevant to the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request refactors the L2 swimlane profiling memory layout by merging L2SwimlaneAicpuPhaseHeader into L2SwimlaneDataHeader and introducing a unified L2SwimlaneActiveHead cache-line structure shared across all pool types. Feedback on these changes highlights three critical issues: first, consolidating record counters into the same cache line as the active buffer pointer and sequence number introduces hot-path cache bouncing between AICore and AICPU; second, the newly merged phase metadata fields in L2SwimlaneDataHeader are left uninitialized during collector initialization, risking garbage reads; and third, the static s_phase_initialized flag on AICPU is not reset between launches, which can leak state and cause undefined behavior.

The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate were a vestige of when sched and orch records shared one path. Schema splits them cleanly into two record types, two BufferKinds, and two pool arrays — type-tagged at the device-side write, no parse-time discriminator on the host side. Schema (header): - L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 } - L2SwimlaneAicpuSchedPhaseRecord (40B): start_time, end_time, loop_iter, kind, tasks_processed (uint32), pop_hit, pop_miss, pad - L2SwimlaneAicpuOrchPhaseRecord (32B): start_time, end_time, task_id, submit_idx, pad - L2SwimlaneBufferKind: AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3 (AicoreTask shifted from 2 to 3 to accommodate the split) - L2SwimlaneDataHeader carries num_sched_phase_threads + num_orch_phase_threads (replaces the single num_phase_threads). - calc_perf_data_size_with_phases takes both counts; new get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers replace get_phase_buffer_state. Dropped (no compat layer): - L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that had been documented as "host parser maps to unknown". - L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer. - kAicpuOrchPhaseIdBase / is_scheduler_phase routing. - The unused `records_per_thread` PhaseHeader field never had a reader. AICPU collector: - s_aicpu_phase_pools[] split into s_sched_phase_pools[] + s_orch_phase_pools[]; same for current-buffer caches. - l2_swimlane_aicpu_init_phase now takes (worker_count, num_sched_phase_threads, num_orch_phase_threads). - record_phase split into record_sched_phase (with kind + pop_hit/pop_miss named, not extras) and record_orch_phase (task_id + submit_idx). - switch_phase_buffer + acquire_phase_slot generalized into kind- parameterized templates shared by both pool types. - flush_phase_buffers drains both pool arrays for the thread. Host collector: - collected_phase_records_ split into collected_sched_phase_records_ and collected_orch_phase_records_; total_phase_collected_ similarly split for clean reconcile per kind. - copy_phase_buffer split into copy_sched_phase_buffer + copy_orch_phase_buffer. - resolve_entry and for_each_instance route on the four BufferKinds; ProfBufferType mirrors. - JSON emit unchanged on the wire: sched section still has "phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch section still has "phase"/"submit_idx"/"task_id"/timestamps (phase string is now hard-coded "orch_submit" since the type tag is the truth). Scheduler call sites: - scheduler_dispatch.cpp's three record_phase sites convert to record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or ::Dispatch. Orchestrator call site: - pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t). Header thread count (scheduler_cold_path.cpp): - sched and orch pool counts computed independently: sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_ orch = orch_to_sched_ ? aicpu_thread_num_ : 1 Python tools: - swimlane_converter.py and sched_overhead_analysis.py read the same field names; orch section's phase key now always equals "orch_submit" (was already the only value). No tool changes required. Test plan: - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at --enable-l2-swimlane level 4): pass - sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass - pre-commit clean Stacked on top of hw-native-sys#941 (PhaseHeader merge).

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp (2)

602-614: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Publish the actual number of initialized phase pools.

Line 602 stores num_sched_threads, but Lines 609-614 initialize num_sched_threads + 1 pools. The host now uses header->num_phase_threads to enumerate phase pools, so the orchestrator pool is skipped for replenish/drain when it has its own slot.

Suggested fix

-    s_l2_swimlane_header->num_phase_threads = num_sched_threads;
-    s_l2_swimlane_header->num_phase_cores = 0;
-    memset(s_l2_swimlane_header->core_to_thread, -1, sizeof(s_l2_swimlane_header->core_to_thread));
-    s_phase_initialized = true;
-
-    // Cache per-thread record pointers and clear buffers
-    // Include all threads: scheduler + orchestrator (orchestrators may become schedulers)
     int total_threads = num_sched_threads + 1;
     if (total_threads > PLATFORM_MAX_AICPU_THREADS) {
         total_threads = PLATFORM_MAX_AICPU_THREADS;
     }
+    s_l2_swimlane_header->num_phase_threads = static_cast<uint32_t>(total_threads);
+    s_l2_swimlane_header->num_phase_cores = 0;
+    memset(s_l2_swimlane_header->core_to_thread, -1, sizeof(s_l2_swimlane_header->core_to_thread));
+    s_phase_initialized = true;
+
+    // Cache per-thread record pointers and clear buffers
+    // Include all threads: scheduler + orchestrator (orchestrators may become schedulers)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp` around lines 602
- 614, The header field s_l2_swimlane_header->num_phase_threads is set to
num_sched_threads but you actually initialize total_threads = num_sched_threads
+ 1 (capped by PLATFORM_MAX_AICPU_THREADS) and allocate that many phase pools
via get_phase_buffer_state; update the code so
s_l2_swimlane_header->num_phase_threads is assigned the actual number of
initialized pools (total_threads) after applying the cap, ensuring the host will
enumerate all created L2SwimlaneAicpuPhasePool entries and not skip the
orchestrator pool; keep the cap logic and set s_phase_initialized as before.

133-146: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reset phase-local singleton state at the start of l2_swimlane_aicpu_init().

These statics survive across runs and are never cleared anywhere in this file. After one phase-enabled session, a later session that skips l2_swimlane_aicpu_init_phase() can still leave s_phase_initialized == true with stale s_aicpu_phase_pools / s_current_aicpu_phase_buffers pointing at the previous shared-memory region.

Suggested fix

 void l2_swimlane_aicpu_init(int worker_count) {
     void *l2_swimlane_base = reinterpret_cast<void *>(g_platform_l2_swimlane_base);
     if (l2_swimlane_base == nullptr) {
         LOG_ERROR("l2_swimlane_data_base is NULL, cannot initialize profiling");
         return;
     }
 
+    s_phase_initialized = false;
+    s_orch_thread_idx = -1;
+    memset(s_aicpu_phase_pools, 0, sizeof(s_aicpu_phase_pools));
+    memset(s_current_aicpu_phase_buffers, 0, sizeof(s_current_aicpu_phase_buffers));
+
     s_l2_swimlane_header = get_l2_swimlane_header(l2_swimlane_base);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp` around lines 133
- 146, At the start of l2_swimlane_aicpu_init() reset the phase-local singleton
state so stale data from previous runs can't persist: explicitly set
s_phase_initialized = false, clear and/or reset s_aicpu_phase_pools (e.g., clear
container and release any held pointers) and set s_current_aicpu_phase_buffers =
nullptr (or clear its map/vector as appropriate), and reset any other
phase-scoped pointers before reading the new shared-memory header
(s_l2_swimlane_header) so subsequent sessions that skip
l2_swimlane_aicpu_init_phase() won't use stale state.

src/a2a3/platform/include/host/l2_swimlane_collector.h (1)

182-189: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate num_phase_threads before iterating phase pools.

This loop now trusts a device-written header field directly. If that count is stale or oversized, get_phase_buffer_state(shm, num_cores, t) walks past the allocated phase-pool tail on the host.

Suggested fix

-        const int num_phase_threads = static_cast<int>(header->num_phase_threads);
+        int num_phase_threads = static_cast<int>(header->num_phase_threads);
+        if (num_phase_threads < 0 || num_phase_threads > PLATFORM_MAX_AICPU_THREADS) {
+            LOG_ERROR(
+                "L2SwimlaneModule: invalid num_phase_threads=%d (max=%d)",
+                num_phase_threads, PLATFORM_MAX_AICPU_THREADS
+            );
+            num_phase_threads = PLATFORM_MAX_AICPU_THREADS;
+        }
         for (int t = 0; t < num_phase_threads; t++) {
             L2SwimlaneAicpuPhasePool *state = get_phase_buffer_state(shm, num_cores, t);
             cb(/*kind=*/1, &state->free_queue, sizeof(L2SwimlaneAicpuPhaseBuffer));
         }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/include/host/l2_swimlane_collector.h` around lines 182 -
189, The code trusts header->num_phase_threads and iterates without bounds
checking, which can walk past allocated phase pools; fix by validating and
clamping num_phase_threads before the loop: compute a safe max (e.g., derived
from the shared-memory layout/allocated phase-pool count or a defined constant
like MAX_PHASE_THREADS), ensure the parsed value is non-negative and <= that max
(use size_t or unsigned for comparison to avoid signed/unsigned bugs), then loop
for t from 0 to min(static_cast<int>(header->num_phase_threads), safe_max)-1 and
proceed to call get_phase_buffer_state(shm, num_cores, t) and cb; if the header
value is out of range, log or skip the extra entries instead of iterating past
bounds.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/a2a3/platform/include/host/l2_swimlane_collector.h`:
- Around line 182-189: The code trusts header->num_phase_threads and iterates
without bounds checking, which can walk past allocated phase pools; fix by
validating and clamping num_phase_threads before the loop: compute a safe max
(e.g., derived from the shared-memory layout/allocated phase-pool count or a
defined constant like MAX_PHASE_THREADS), ensure the parsed value is
non-negative and <= that max (use size_t or unsigned for comparison to avoid
signed/unsigned bugs), then loop for t from 0 to
min(static_cast<int>(header->num_phase_threads), safe_max)-1 and proceed to call
get_phase_buffer_state(shm, num_cores, t) and cb; if the header value is out of
range, log or skip the extra entries instead of iterating past bounds.

In `@src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp`:
- Around line 602-614: The header field s_l2_swimlane_header->num_phase_threads
is set to num_sched_threads but you actually initialize total_threads =
num_sched_threads + 1 (capped by PLATFORM_MAX_AICPU_THREADS) and allocate that
many phase pools via get_phase_buffer_state; update the code so
s_l2_swimlane_header->num_phase_threads is assigned the actual number of
initialized pools (total_threads) after applying the cap, ensuring the host will
enumerate all created L2SwimlaneAicpuPhasePool entries and not skip the
orchestrator pool; keep the cap logic and set s_phase_initialized as before.
- Around line 133-146: At the start of l2_swimlane_aicpu_init() reset the
phase-local singleton state so stale data from previous runs can't persist:
explicitly set s_phase_initialized = false, clear and/or reset
s_aicpu_phase_pools (e.g., clear container and release any held pointers) and
set s_current_aicpu_phase_buffers = nullptr (or clear its map/vector as
appropriate), and reset any other phase-scoped pointers before reading the new
shared-memory header (s_l2_swimlane_header) so subsequent sessions that skip
l2_swimlane_aicpu_init_phase() won't use stale state.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8bd99661-e2b3-418e-b7ff-3e1d29229a15

📥 Commits

Reviewing files that changed from the base of the PR and between dc83d5a and b0b32d6.

📒 Files selected for processing (5)

src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h
src/a2a3/platform/include/common/l2_swimlane_profiling.h
src/a2a3/platform/include/host/l2_swimlane_collector.h
src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp
src/a2a3/platform/src/host/l2_swimlane_collector.cpp

The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate were a vestige of when sched and orch records shared one path. Schema splits them cleanly into two record types, two BufferKinds, and two pool arrays — type-tagged at the device-side write, no parse-time discriminator on the host side. Schema (header): - L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 } - L2SwimlaneAicpuSchedPhaseRecord (40B): start_time, end_time, loop_iter, kind, tasks_processed (uint32), pop_hit, pop_miss, pad - L2SwimlaneAicpuOrchPhaseRecord (32B): start_time, end_time, task_id, submit_idx, pad - L2SwimlaneBufferKind: AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3 (AicoreTask shifted from 2 to 3 to accommodate the split) - L2SwimlaneDataHeader carries num_sched_phase_threads + num_orch_phase_threads (replaces the single num_phase_threads). - calc_perf_data_size_with_phases takes both counts; new get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers replace get_phase_buffer_state. Dropped (no compat layer): - L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that had been documented as "host parser maps to unknown". - L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer. - kAicpuOrchPhaseIdBase / is_scheduler_phase routing. - The unused `records_per_thread` PhaseHeader field never had a reader. AICPU collector: - s_aicpu_phase_pools[] split into s_sched_phase_pools[] + s_orch_phase_pools[]; same for current-buffer caches. - l2_swimlane_aicpu_init_phase now takes (worker_count, num_sched_phase_threads, num_orch_phase_threads). - record_phase split into record_sched_phase (with kind + pop_hit/pop_miss named, not extras) and record_orch_phase (task_id + submit_idx). - switch_phase_buffer + acquire_phase_slot generalized into kind- parameterized templates shared by both pool types. - flush_phase_buffers drains both pool arrays for the thread. Host collector: - collected_phase_records_ split into collected_sched_phase_records_ and collected_orch_phase_records_; total_phase_collected_ similarly split for clean reconcile per kind. - copy_phase_buffer split into copy_sched_phase_buffer + copy_orch_phase_buffer. - resolve_entry and for_each_instance route on the four BufferKinds; ProfBufferType mirrors. - JSON emit unchanged on the wire: sched section still has "phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch section still has "phase"/"submit_idx"/"task_id"/timestamps (phase string is now hard-coded "orch_submit" since the type tag is the truth). Scheduler call sites: - scheduler_dispatch.cpp's three record_phase sites convert to record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or ::Dispatch. Orchestrator call site: - pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t). Header thread count (scheduler_cold_path.cpp): - sched and orch pool counts computed independently: sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_ orch = orch_to_sched_ ? aicpu_thread_num_ : 1 Python tools: - swimlane_converter.py and sched_overhead_analysis.py read the same field names; orch section's phase key now always equals "orch_submit" (was already the only value). No tool changes required. Test plan: - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at --enable-l2-swimlane level 4): pass - sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass - pre-commit clean Stacked on top of hw-native-sys#941 (PhaseHeader merge).

The standalone phase header was a vestige of when phase profiling was an add-on bolted onto the shared-memory layout. Phase metadata is now co-equal with task pool metadata, so the dedicated cache line + magic gate + indirection are pure overhead. Move the three live phase-header fields directly into the root header: - num_sched_threads → num_phase_threads (renamed for clarity; it counts phase pools, which equals sched_thread_num or aicpu_thread_num depending on PTO2_ORCH_TO_SCHED) - num_cores → num_phase_cores (disambiguate from the root header's pre-existing num_cores — they have different semantics) - core_to_thread[PLATFORM_MAX_CORES] — verbatim Dropped: - magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates on `num_phase_threads > 0` (zero-init means phase init never ran). - records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD but no caller ever read it — dead field. Shared-memory layout after the merge: [L2SwimlaneDataHeader (now includes phase metadata)] [L2SwimlaneAicpuTaskPool × num_cores] [L2SwimlaneAicoreTaskPool × num_cores] [L2SwimlaneAicpuPhasePool × num_phase_threads] ← was preceded by header `get_phase_header()` is deleted; `get_phase_buffer_states()` skips straight from the AicoreTaskPool array to the phase pools. AICPU collector keeps a separate `s_phase_initialized` bool so gated paths can check init-ran without re-reading the device-shared header on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header == nullptr` check. Built atop hw-native-sys#939 (ActiveHead refactor). Test plan: - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass - sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass - pre-commit clean

The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate were a vestige of when sched and orch records shared one path. Schema splits them cleanly into two record types, two BufferKinds, and two pool arrays — type-tagged at the device-side write, no parse-time discriminator on the host side. Schema (header): - L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 } - L2SwimlaneAicpuSchedPhaseRecord (40B): start_time, end_time, loop_iter, kind, tasks_processed (uint32), pop_hit, pop_miss, pad - L2SwimlaneAicpuOrchPhaseRecord (32B): start_time, end_time, task_id, submit_idx, pad - L2SwimlaneBufferKind: AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3 (AicoreTask shifted from 2 to 3 to accommodate the split) - L2SwimlaneDataHeader carries num_sched_phase_threads + num_orch_phase_threads (replaces the single num_phase_threads). - calc_perf_data_size_with_phases takes both counts; new get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers replace get_phase_buffer_state. Dropped (no compat layer): - L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that had been documented as "host parser maps to unknown". - L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer. - kAicpuOrchPhaseIdBase / is_scheduler_phase routing. - The unused `records_per_thread` PhaseHeader field never had a reader. AICPU collector: - s_aicpu_phase_pools[] split into s_sched_phase_pools[] + s_orch_phase_pools[]; same for current-buffer caches. - l2_swimlane_aicpu_init_phase now takes (worker_count, num_sched_phase_threads, num_orch_phase_threads). - record_phase split into record_sched_phase (with kind + pop_hit/pop_miss named, not extras) and record_orch_phase (task_id + submit_idx). - switch_phase_buffer + acquire_phase_slot generalized into kind- parameterized templates shared by both pool types. - flush_phase_buffers drains both pool arrays for the thread. Host collector: - collected_phase_records_ split into collected_sched_phase_records_ and collected_orch_phase_records_; total_phase_collected_ similarly split for clean reconcile per kind. - copy_phase_buffer split into copy_sched_phase_buffer + copy_orch_phase_buffer. - resolve_entry and for_each_instance route on the four BufferKinds; ProfBufferType mirrors. - JSON emit unchanged on the wire: sched section still has "phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch section still has "phase"/"submit_idx"/"task_id"/timestamps (phase string is now hard-coded "orch_submit" since the type tag is the truth). Scheduler call sites: - scheduler_dispatch.cpp's three record_phase sites convert to record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or ::Dispatch. Orchestrator call site: - pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t). Header thread count (scheduler_cold_path.cpp): - sched and orch pool counts computed independently: sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_ orch = orch_to_sched_ ? aicpu_thread_num_ : 1 Python tools: - swimlane_converter.py and sched_overhead_analysis.py read the same field names; orch section's phase key now always equals "orch_submit" (was already the only value). No tool changes required. Test plan: - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at --enable-l2-swimlane level 4): pass - sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass - pre-commit clean Stacked on top of hw-native-sys#941 (PhaseHeader merge).

The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate were a vestige of when sched and orch records shared one path. Schema splits them cleanly into two record types, two BufferKinds, and two pool arrays — type-tagged at the device-side write, no parse-time discriminator on the host side. Schema (header): - L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 } - L2SwimlaneAicpuSchedPhaseRecord (40B): start_time, end_time, loop_iter, kind, tasks_processed (uint32), pop_hit, pop_miss, pad - L2SwimlaneAicpuOrchPhaseRecord (32B): start_time, end_time, task_id, submit_idx, pad - L2SwimlaneBufferKind: AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3 (AicoreTask shifted from 2 to 3 to accommodate the split) - L2SwimlaneDataHeader carries num_sched_phase_threads + num_orch_phase_threads (replaces the single num_phase_threads). - calc_perf_data_size_with_phases takes both counts; new get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers replace get_phase_buffer_state. Dropped (no compat layer): - L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that had been documented as "host parser maps to unknown". - L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer. - kAicpuOrchPhaseIdBase / is_scheduler_phase routing. - The unused `records_per_thread` PhaseHeader field never had a reader. AICPU collector: - s_aicpu_phase_pools[] split into s_sched_phase_pools[] + s_orch_phase_pools[]; same for current-buffer caches. - l2_swimlane_aicpu_init_phase now takes (worker_count, num_sched_phase_threads, num_orch_phase_threads). - record_phase split into record_sched_phase (with kind + pop_hit/pop_miss named, not extras) and record_orch_phase (task_id + submit_idx). - switch_phase_buffer + acquire_phase_slot generalized into kind- parameterized templates shared by both pool types. - flush_phase_buffers drains both pool arrays for the thread. Host collector: - collected_phase_records_ split into collected_sched_phase_records_ and collected_orch_phase_records_; total_phase_collected_ similarly split for clean reconcile per kind. - copy_phase_buffer split into copy_sched_phase_buffer + copy_orch_phase_buffer. - resolve_entry and for_each_instance route on the four BufferKinds; ProfBufferType mirrors. - JSON emit unchanged on the wire: sched section still has "phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch section still has "phase"/"submit_idx"/"task_id"/timestamps (phase string is now hard-coded "orch_submit" since the type tag is the truth). Scheduler call sites: - scheduler_dispatch.cpp's three record_phase sites convert to record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or ::Dispatch. Orchestrator call site: - pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t). Header thread count (scheduler_cold_path.cpp): - sched and orch pool counts computed independently: sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_ orch = orch_to_sched_ ? aicpu_thread_num_ : 1 Python tools: - swimlane_converter.py and sched_overhead_analysis.py read the same field names; orch section's phase key now always equals "orch_submit" (was already the only value). No tool changes required. Test plan: - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at --enable-l2-swimlane level 4): pass - sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass - pre-commit clean Stacked on top of #941 (PhaseHeader merge). Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>

After #939 (pool unification), #941 (PhaseHeader merge), and #942 (split sched/orch phase records), several comments and doc sections still referenced the pre-split a2a3 layout. Audit and update: a2a3 code/comments: - platform_config.h: PROF_BUFFERS_PER_THREAD doc references both SchedPhaseBuffer and OrchPhaseBuffer (was: single PhaseBuffer); PROF_READYQUEUE_SIZE comment now says "four kinds"; formula bumped by 2x on the per-thread term to cover both sched and orch pool enqueues (matches host alloc which iterates both pool arrays). - l2_swimlane_profiling.h header layout diagram: name the two split phase-thread counts. - l2_swimlane_collector_aicpu.cpp: cross-launch reset comment now references s_sched_phase_pools / s_orch_phase_pools (was: single s_aicpu_phase_pools) and record_sched_phase / record_orch_phase. - scheduler_dispatch.cpp / aicpu_executor.cpp: comments reference the split record types. src/common/ shared comments (now mixed-arch): - profiler_base.h / buffer_pool_manager.h: qualify L2SwimlaneAicpuPhaseHeader::magic example as "on a5" since the struct no longer exists on a2a3. docs/dfx/l2-swimlane-profiling.md: - §5.1: layout block + record list now distinguish a2a3 split shape (SchedPhaseRecord 40B + OrchPhaseRecord 32B, two pool arrays) from a5's still-unified shape (pending port). - §5.2: a2a3 buffer-kind list updated to all four kinds (was: two); ASCII data-flow diagram redrawn to show split phase records; kBufferKinds = 4 in the L2SwimlaneModule trait description. - §5.3 (a5): num_phase_threads / core_to_thread[] reference corrected to live in L2SwimlaneAicpuPhaseHeader on a5 (was wrongly attributed to L2SwimlaneDataHeader). - §5.4: comparison table separates task record (identical) from phase record (diverged); ready-queue and kBufferKinds rows call out the a2a3=4 vs a5=2 split. - §6: overhead description differentiates a2a3's per-emit SchedPhase + per-submit OrchPhase from a5's unified PhaseRecord (was: "4 phases × 40B per iteration", which described a removed shape). - §8 FAQ: "phase records empty" entry gates a2a3 on num_{sched,orch}_phase_threads, a5 on PhaseHeader::magic. No semantic code changes except the READYQUEUE_SIZE formula bump (adds ~8KB to the header; necessary correctness fix given the second phase pool). Test plan: - pre-commit clean - onboard l2_swimlane STs (--enable-l2-swimlane --enable-dep-gen): 2 passed - onboard paged_attention_unroll level 4: 1 passed Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>

gemini-code-assist Bot reviewed May 31, 2026

View reviewed changes

Comment thread src/a2a3/platform/include/common/l2_swimlane_profiling.h

Comment thread src/a2a3/platform/src/host/l2_swimlane_collector.cpp

Comment thread src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp

ChaoWao force-pushed the refactor/swimlane-merge-phase-header branch 2 times, most recently from 7dc8958 to 2474465 Compare May 31, 2026 07:30

hw-native-sys-bot mentioned this pull request May 31, 2026

Refactor: split L2 swimlane phase records into sched + orch types #942

Merged

5 tasks

ChaoWao force-pushed the refactor/swimlane-merge-phase-header branch from 2474465 to 84f46b1 Compare May 31, 2026 08:18

ChaoWao force-pushed the refactor/swimlane-merge-phase-header branch from 84f46b1 to b0b32d6 Compare May 31, 2026 09:02

coderabbitai Bot reviewed May 31, 2026

View reviewed changes

ChaoWao force-pushed the refactor/swimlane-merge-phase-header branch from b0b32d6 to e6de52f Compare May 31, 2026 09:54

ChaoWao force-pushed the refactor/swimlane-merge-phase-header branch from e6de52f to a0e8d08 Compare May 31, 2026 10:18

ChaoWao approved these changes May 31, 2026

View reviewed changes

ChaoWao merged commit 0331dcc into hw-native-sys:main May 31, 2026
15 checks passed

ChaoWao deleted the refactor/swimlane-merge-phase-header branch May 31, 2026 10:46

hw-native-sys-bot mentioned this pull request May 31, 2026

Doc: sync L2 swimlane refs to post-split layout #946

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: merge L2SwimlaneAicpuPhaseHeader into L2SwimlaneDataHeader#941

Refactor: merge L2SwimlaneAicpuPhaseHeader into L2SwimlaneDataHeader#941
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/swimlane-merge-phase-header

hw-native-sys-bot commented May 31, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 31, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hw-native-sys-bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dropped

Shared-memory layout

AICPU collector init-ran gate

Dependency on #939

Test plan

Uh oh!

coderabbitai Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hw-native-sys-bot commented May 31, 2026 •

edited

Loading

coderabbitai Bot commented May 31, 2026 •

edited

Loading