Skip to content

Refactor: merge L2SwimlaneAicpuPhaseHeader into L2SwimlaneDataHeader#941

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/swimlane-merge-phase-header
May 31, 2026
Merged

Refactor: merge L2SwimlaneAicpuPhaseHeader into L2SwimlaneDataHeader#941
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/swimlane-merge-phase-header

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

@hw-native-sys-bot hw-native-sys-bot commented May 31, 2026

Summary

The standalone L2SwimlaneAicpuPhaseHeader was a vestige of when phase profiling was an add-on bolted onto the shared-memory layout. Phase metadata is now co-equal with task pool metadata, so the dedicated cache line + magic gate + indirection are pure overhead.

This PR moves the three live phase-header fields directly into L2SwimlaneDataHeader:

Old field (PhaseHeader) New field (DataHeader) Note
num_sched_threads num_phase_threads renamed for clarity; counts phase pools — equals sched_thread_num_ or aicpu_thread_num_ depending on PTO2_ORCH_TO_SCHED
num_cores num_phase_cores disambiguated from root header's pre-existing num_cores (different semantics)
core_to_thread[PLATFORM_MAX_CORES] core_to_thread[PLATFORM_MAX_CORES] verbatim

Dropped

  • magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates on num_phase_threads > 0 (zero-init means phase init never ran).
  • records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD but no caller ever read it — dead field.

Shared-memory layout

[L2SwimlaneDataHeader (now includes phase metadata)]
[L2SwimlaneAicpuTaskPool × num_cores]
[L2SwimlaneAicoreTaskPool × num_cores]
[L2SwimlaneAicpuPhasePool × num_phase_threads]   ← was preceded by PhaseHeader

get_phase_header() is deleted. get_phase_buffer_states() skips straight from the AicoreTaskPool array to the phase pools.

AICPU collector init-ran gate

Phase-gated AICPU paths previously checked s_l2_swimlane_aicpu_phase_header == nullptr. After the merge they check a new s_phase_initialized static bool, so the hot path doesn't re-read the device-shared header just to test init-ran.

Dependency on #939

This branch is stacked on top of refactor/swimlane-cache-line-blocks (#939). The PR's base is main (GitHub won't let us target a fork branch as base), so the current diff temporarily includes #939's 11 files on top of this PR's 5 files. After #939 merges, this PR's diff will auto-clean to just C's 5 files (≈75 added / 104 deleted). Review #939 first; only the 5 phase-header files are this PR's contribution:

  • src/a2a3/platform/include/common/l2_swimlane_profiling.h
  • src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h
  • src/a2a3/platform/include/host/l2_swimlane_collector.h
  • src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp
  • src/a2a3/platform/src/host/l2_swimlane_collector.cpp

Test plan

  • sim swimlane ST passes: pytest tests/st/.../dfx/l2_swimlane --platform a2a3sim --enable-l2-swimlane
  • sim DFX (scope_stats / tensor_dump / pmu / dep_gen) passes
  • pre-commit run clean
  • CI green (onboard + sim, a2a3)
  • a5 port: not in this PR (a5 has the same PhaseHeader pattern; will land via a separate a5 sweep)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 31, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e860de41-9e1e-4055-b8de-fb948eec45bf

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Phase profiling metadata is consolidated from a standalone L2SwimlaneAicpuPhaseHeader into L2SwimlaneDataHeader across the swimlane collector and AICPU recording paths. AICPU init and core-assignment operations now update the shared header directly and gate on an s_phase_initialized flag. Host-side metadata reading sources from the same header location without magic validation, and memory layout calculations are updated accordingly.

Changes

Phase metadata structure consolidation

Layer / File(s) Summary
Phase metadata struct and layout
src/a2a3/platform/include/common/l2_swimlane_profiling.h
L2SwimlaneDataHeader gains num_phase_threads, num_phase_cores, and core_to_thread[] fields; standalone L2SwimlaneAicpuPhaseHeader struct and L2_SWIMLANE_AICPU_PHASE_MAGIC constant are removed. Memory-size calculation and phase buffer accessors are updated to treat phase pools as immediately following the AICore task pool, eliminating intermediate phase-header offsets.
AICPU phase initialization and recording
src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp, src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h
AICPU side introduces s_phase_initialized flag and cached s_l2_swimlane_header pointer, replacing cached phase-header pointer. l2_swimlane_aicpu_init_phase writes phase metadata directly into the shared header and marks initialization complete. Core-assignment functions write core mappings into the header. Phase recording functions gate on the initialized flag instead of header-pointer nullness.
Host-side phase metadata initialization and reading
src/a2a3/platform/src/host/l2_swimlane_collector.cpp, src/a2a3/platform/include/host/l2_swimlane_collector.h
Host collector explicitly initializes phase metadata fields in L2SwimlaneDataHeader during setup. read_phase_header_metadata() sources metadata directly from the shared header instead of a separate phase header, removing magic-constant validation. for_each_instance gates on num_phase_threads from the main header, and core-to-thread mapping is derived from header-resident num_phase_cores and array.
Documentation updates
src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h, src/a2a3/platform/include/host/l2_swimlane_collector.h, src/a2a3/platform/src/host/l2_swimlane_collector.cpp
Comments across AICPU header, host collector header, and implementation clarify that phase metadata is now written to and read from L2SwimlaneDataHeader fields rather than a separate phase-header structure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Poem

🐰 A rabbit hops through swimlane streams,
Phase headers merge like gentle dreams,
One central home for metadata's tale,
No magic needed—the truth will not fail!
Simpler, swifter, cleaner by design!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and accurately summarizes the main refactoring: moving L2SwimlaneAicpuPhaseHeader fields into L2SwimlaneDataHeader.
Description check ✅ Passed The description thoroughly explains the refactoring rationale, field mappings, dropped fields, shared-memory layout changes, and gating logic changes, all directly relevant to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the L2 swimlane profiling memory layout by merging L2SwimlaneAicpuPhaseHeader into L2SwimlaneDataHeader and introducing a unified L2SwimlaneActiveHead cache-line structure shared across all pool types. Feedback on these changes highlights three critical issues: first, consolidating record counters into the same cache line as the active buffer pointer and sequence number introduces hot-path cache bouncing between AICore and AICPU; second, the newly merged phase metadata fields in L2SwimlaneDataHeader are left uninitialized during collector initialization, risking garbage reads; and third, the static s_phase_initialized flag on AICPU is not reset between launches, which can leak state and cause undefined behavior.

Comment thread src/a2a3/platform/include/common/l2_swimlane_profiling.h
Comment thread src/a2a3/platform/src/host/l2_swimlane_collector.cpp
Comment thread src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp
@ChaoWao ChaoWao force-pushed the refactor/swimlane-merge-phase-header branch 2 times, most recently from 7dc8958 to 2474465 Compare May 31, 2026 07:30
@ChaoWao ChaoWao force-pushed the refactor/swimlane-merge-phase-header branch from 2474465 to 84f46b1 Compare May 31, 2026 08:18
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.

Schema (header):
  - L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
  - L2SwimlaneAicpuSchedPhaseRecord (40B):
      start_time, end_time, loop_iter, kind, tasks_processed (uint32),
      pop_hit, pop_miss, pad
  - L2SwimlaneAicpuOrchPhaseRecord (32B):
      start_time, end_time, task_id, submit_idx, pad
  - L2SwimlaneBufferKind:
      AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
      (AicoreTask shifted from 2 to 3 to accommodate the split)
  - L2SwimlaneDataHeader carries num_sched_phase_threads +
    num_orch_phase_threads (replaces the single num_phase_threads).
  - calc_perf_data_size_with_phases takes both counts; new
    get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
    replace get_phase_buffer_state.

Dropped (no compat layer):
  - L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
    had been documented as "host parser maps to unknown".
  - L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
  - kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
  - The unused `records_per_thread` PhaseHeader field never had a
    reader.

AICPU collector:
  - s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
    s_orch_phase_pools[]; same for current-buffer caches.
  - l2_swimlane_aicpu_init_phase now takes (worker_count,
    num_sched_phase_threads, num_orch_phase_threads).
  - record_phase split into record_sched_phase (with kind +
    pop_hit/pop_miss named, not extras) and record_orch_phase
    (task_id + submit_idx).
  - switch_phase_buffer + acquire_phase_slot generalized into kind-
    parameterized templates shared by both pool types.
  - flush_phase_buffers drains both pool arrays for the thread.

Host collector:
  - collected_phase_records_ split into collected_sched_phase_records_
    and collected_orch_phase_records_; total_phase_collected_
    similarly split for clean reconcile per kind.
  - copy_phase_buffer split into copy_sched_phase_buffer +
    copy_orch_phase_buffer.
  - resolve_entry and for_each_instance route on the four
    BufferKinds; ProfBufferType mirrors.
  - JSON emit unchanged on the wire: sched section still has
    "phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
    section still has "phase"/"submit_idx"/"task_id"/timestamps
    (phase string is now hard-coded "orch_submit" since the type tag
    is the truth).

Scheduler call sites:
  - scheduler_dispatch.cpp's three record_phase sites convert to
    record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
    ::Dispatch.

Orchestrator call site:
  - pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
    L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
    signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).

Header thread count (scheduler_cold_path.cpp):
  - sched and orch pool counts computed independently:
      sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
      orch  = orch_to_sched_ ? aicpu_thread_num_ : 1

Python tools:
  - swimlane_converter.py and sched_overhead_analysis.py read the
    same field names; orch section's phase key now always equals
    "orch_submit" (was already the only value). No tool changes
    required.

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
    --enable-l2-swimlane level 4): pass
  - sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean

Stacked on top of hw-native-sys#941 (PhaseHeader merge).
@ChaoWao ChaoWao force-pushed the refactor/swimlane-merge-phase-header branch from 84f46b1 to b0b32d6 Compare May 31, 2026 09:02
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp (2)

602-614: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Publish the actual number of initialized phase pools.

Line 602 stores num_sched_threads, but Lines 609-614 initialize num_sched_threads + 1 pools. The host now uses header->num_phase_threads to enumerate phase pools, so the orchestrator pool is skipped for replenish/drain when it has its own slot.

Suggested fix
-    s_l2_swimlane_header->num_phase_threads = num_sched_threads;
-    s_l2_swimlane_header->num_phase_cores = 0;
-    memset(s_l2_swimlane_header->core_to_thread, -1, sizeof(s_l2_swimlane_header->core_to_thread));
-    s_phase_initialized = true;
-
-    // Cache per-thread record pointers and clear buffers
-    // Include all threads: scheduler + orchestrator (orchestrators may become schedulers)
     int total_threads = num_sched_threads + 1;
     if (total_threads > PLATFORM_MAX_AICPU_THREADS) {
         total_threads = PLATFORM_MAX_AICPU_THREADS;
     }
+    s_l2_swimlane_header->num_phase_threads = static_cast<uint32_t>(total_threads);
+    s_l2_swimlane_header->num_phase_cores = 0;
+    memset(s_l2_swimlane_header->core_to_thread, -1, sizeof(s_l2_swimlane_header->core_to_thread));
+    s_phase_initialized = true;
+
+    // Cache per-thread record pointers and clear buffers
+    // Include all threads: scheduler + orchestrator (orchestrators may become schedulers)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp` around lines 602
- 614, The header field s_l2_swimlane_header->num_phase_threads is set to
num_sched_threads but you actually initialize total_threads = num_sched_threads
+ 1 (capped by PLATFORM_MAX_AICPU_THREADS) and allocate that many phase pools
via get_phase_buffer_state; update the code so
s_l2_swimlane_header->num_phase_threads is assigned the actual number of
initialized pools (total_threads) after applying the cap, ensuring the host will
enumerate all created L2SwimlaneAicpuPhasePool entries and not skip the
orchestrator pool; keep the cap logic and set s_phase_initialized as before.

133-146: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reset phase-local singleton state at the start of l2_swimlane_aicpu_init().

These statics survive across runs and are never cleared anywhere in this file. After one phase-enabled session, a later session that skips l2_swimlane_aicpu_init_phase() can still leave s_phase_initialized == true with stale s_aicpu_phase_pools / s_current_aicpu_phase_buffers pointing at the previous shared-memory region.

Suggested fix
 void l2_swimlane_aicpu_init(int worker_count) {
     void *l2_swimlane_base = reinterpret_cast<void *>(g_platform_l2_swimlane_base);
     if (l2_swimlane_base == nullptr) {
         LOG_ERROR("l2_swimlane_data_base is NULL, cannot initialize profiling");
         return;
     }
 
+    s_phase_initialized = false;
+    s_orch_thread_idx = -1;
+    memset(s_aicpu_phase_pools, 0, sizeof(s_aicpu_phase_pools));
+    memset(s_current_aicpu_phase_buffers, 0, sizeof(s_current_aicpu_phase_buffers));
+
     s_l2_swimlane_header = get_l2_swimlane_header(l2_swimlane_base);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp` around lines 133
- 146, At the start of l2_swimlane_aicpu_init() reset the phase-local singleton
state so stale data from previous runs can't persist: explicitly set
s_phase_initialized = false, clear and/or reset s_aicpu_phase_pools (e.g., clear
container and release any held pointers) and set s_current_aicpu_phase_buffers =
nullptr (or clear its map/vector as appropriate), and reset any other
phase-scoped pointers before reading the new shared-memory header
(s_l2_swimlane_header) so subsequent sessions that skip
l2_swimlane_aicpu_init_phase() won't use stale state.
src/a2a3/platform/include/host/l2_swimlane_collector.h (1)

182-189: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate num_phase_threads before iterating phase pools.

This loop now trusts a device-written header field directly. If that count is stale or oversized, get_phase_buffer_state(shm, num_cores, t) walks past the allocated phase-pool tail on the host.

Suggested fix
-        const int num_phase_threads = static_cast<int>(header->num_phase_threads);
+        int num_phase_threads = static_cast<int>(header->num_phase_threads);
+        if (num_phase_threads < 0 || num_phase_threads > PLATFORM_MAX_AICPU_THREADS) {
+            LOG_ERROR(
+                "L2SwimlaneModule: invalid num_phase_threads=%d (max=%d)",
+                num_phase_threads, PLATFORM_MAX_AICPU_THREADS
+            );
+            num_phase_threads = PLATFORM_MAX_AICPU_THREADS;
+        }
         for (int t = 0; t < num_phase_threads; t++) {
             L2SwimlaneAicpuPhasePool *state = get_phase_buffer_state(shm, num_cores, t);
             cb(/*kind=*/1, &state->free_queue, sizeof(L2SwimlaneAicpuPhaseBuffer));
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/include/host/l2_swimlane_collector.h` around lines 182 -
189, The code trusts header->num_phase_threads and iterates without bounds
checking, which can walk past allocated phase pools; fix by validating and
clamping num_phase_threads before the loop: compute a safe max (e.g., derived
from the shared-memory layout/allocated phase-pool count or a defined constant
like MAX_PHASE_THREADS), ensure the parsed value is non-negative and <= that max
(use size_t or unsigned for comparison to avoid signed/unsigned bugs), then loop
for t from 0 to min(static_cast<int>(header->num_phase_threads), safe_max)-1 and
proceed to call get_phase_buffer_state(shm, num_cores, t) and cb; if the header
value is out of range, log or skip the extra entries instead of iterating past
bounds.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/a2a3/platform/include/host/l2_swimlane_collector.h`:
- Around line 182-189: The code trusts header->num_phase_threads and iterates
without bounds checking, which can walk past allocated phase pools; fix by
validating and clamping num_phase_threads before the loop: compute a safe max
(e.g., derived from the shared-memory layout/allocated phase-pool count or a
defined constant like MAX_PHASE_THREADS), ensure the parsed value is
non-negative and <= that max (use size_t or unsigned for comparison to avoid
signed/unsigned bugs), then loop for t from 0 to
min(static_cast<int>(header->num_phase_threads), safe_max)-1 and proceed to call
get_phase_buffer_state(shm, num_cores, t) and cb; if the header value is out of
range, log or skip the extra entries instead of iterating past bounds.

In `@src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp`:
- Around line 602-614: The header field s_l2_swimlane_header->num_phase_threads
is set to num_sched_threads but you actually initialize total_threads =
num_sched_threads + 1 (capped by PLATFORM_MAX_AICPU_THREADS) and allocate that
many phase pools via get_phase_buffer_state; update the code so
s_l2_swimlane_header->num_phase_threads is assigned the actual number of
initialized pools (total_threads) after applying the cap, ensuring the host will
enumerate all created L2SwimlaneAicpuPhasePool entries and not skip the
orchestrator pool; keep the cap logic and set s_phase_initialized as before.
- Around line 133-146: At the start of l2_swimlane_aicpu_init() reset the
phase-local singleton state so stale data from previous runs can't persist:
explicitly set s_phase_initialized = false, clear and/or reset
s_aicpu_phase_pools (e.g., clear container and release any held pointers) and
set s_current_aicpu_phase_buffers = nullptr (or clear its map/vector as
appropriate), and reset any other phase-scoped pointers before reading the new
shared-memory header (s_l2_swimlane_header) so subsequent sessions that skip
l2_swimlane_aicpu_init_phase() won't use stale state.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8bd99661-e2b3-418e-b7ff-3e1d29229a15

📥 Commits

Reviewing files that changed from the base of the PR and between dc83d5a and b0b32d6.

📒 Files selected for processing (5)
  • src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h
  • src/a2a3/platform/include/common/l2_swimlane_profiling.h
  • src/a2a3/platform/include/host/l2_swimlane_collector.h
  • src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp
  • src/a2a3/platform/src/host/l2_swimlane_collector.cpp

ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.

Schema (header):
  - L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
  - L2SwimlaneAicpuSchedPhaseRecord (40B):
      start_time, end_time, loop_iter, kind, tasks_processed (uint32),
      pop_hit, pop_miss, pad
  - L2SwimlaneAicpuOrchPhaseRecord (32B):
      start_time, end_time, task_id, submit_idx, pad
  - L2SwimlaneBufferKind:
      AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
      (AicoreTask shifted from 2 to 3 to accommodate the split)
  - L2SwimlaneDataHeader carries num_sched_phase_threads +
    num_orch_phase_threads (replaces the single num_phase_threads).
  - calc_perf_data_size_with_phases takes both counts; new
    get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
    replace get_phase_buffer_state.

Dropped (no compat layer):
  - L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
    had been documented as "host parser maps to unknown".
  - L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
  - kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
  - The unused `records_per_thread` PhaseHeader field never had a
    reader.

AICPU collector:
  - s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
    s_orch_phase_pools[]; same for current-buffer caches.
  - l2_swimlane_aicpu_init_phase now takes (worker_count,
    num_sched_phase_threads, num_orch_phase_threads).
  - record_phase split into record_sched_phase (with kind +
    pop_hit/pop_miss named, not extras) and record_orch_phase
    (task_id + submit_idx).
  - switch_phase_buffer + acquire_phase_slot generalized into kind-
    parameterized templates shared by both pool types.
  - flush_phase_buffers drains both pool arrays for the thread.

Host collector:
  - collected_phase_records_ split into collected_sched_phase_records_
    and collected_orch_phase_records_; total_phase_collected_
    similarly split for clean reconcile per kind.
  - copy_phase_buffer split into copy_sched_phase_buffer +
    copy_orch_phase_buffer.
  - resolve_entry and for_each_instance route on the four
    BufferKinds; ProfBufferType mirrors.
  - JSON emit unchanged on the wire: sched section still has
    "phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
    section still has "phase"/"submit_idx"/"task_id"/timestamps
    (phase string is now hard-coded "orch_submit" since the type tag
    is the truth).

Scheduler call sites:
  - scheduler_dispatch.cpp's three record_phase sites convert to
    record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
    ::Dispatch.

Orchestrator call site:
  - pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
    L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
    signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).

Header thread count (scheduler_cold_path.cpp):
  - sched and orch pool counts computed independently:
      sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
      orch  = orch_to_sched_ ? aicpu_thread_num_ : 1

Python tools:
  - swimlane_converter.py and sched_overhead_analysis.py read the
    same field names; orch section's phase key now always equals
    "orch_submit" (was already the only value). No tool changes
    required.

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
    --enable-l2-swimlane level 4): pass
  - sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean

Stacked on top of hw-native-sys#941 (PhaseHeader merge).
@ChaoWao ChaoWao force-pushed the refactor/swimlane-merge-phase-header branch from b0b32d6 to e6de52f Compare May 31, 2026 09:54
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.

Schema (header):
  - L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
  - L2SwimlaneAicpuSchedPhaseRecord (40B):
      start_time, end_time, loop_iter, kind, tasks_processed (uint32),
      pop_hit, pop_miss, pad
  - L2SwimlaneAicpuOrchPhaseRecord (32B):
      start_time, end_time, task_id, submit_idx, pad
  - L2SwimlaneBufferKind:
      AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
      (AicoreTask shifted from 2 to 3 to accommodate the split)
  - L2SwimlaneDataHeader carries num_sched_phase_threads +
    num_orch_phase_threads (replaces the single num_phase_threads).
  - calc_perf_data_size_with_phases takes both counts; new
    get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
    replace get_phase_buffer_state.

Dropped (no compat layer):
  - L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
    had been documented as "host parser maps to unknown".
  - L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
  - kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
  - The unused `records_per_thread` PhaseHeader field never had a
    reader.

AICPU collector:
  - s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
    s_orch_phase_pools[]; same for current-buffer caches.
  - l2_swimlane_aicpu_init_phase now takes (worker_count,
    num_sched_phase_threads, num_orch_phase_threads).
  - record_phase split into record_sched_phase (with kind +
    pop_hit/pop_miss named, not extras) and record_orch_phase
    (task_id + submit_idx).
  - switch_phase_buffer + acquire_phase_slot generalized into kind-
    parameterized templates shared by both pool types.
  - flush_phase_buffers drains both pool arrays for the thread.

Host collector:
  - collected_phase_records_ split into collected_sched_phase_records_
    and collected_orch_phase_records_; total_phase_collected_
    similarly split for clean reconcile per kind.
  - copy_phase_buffer split into copy_sched_phase_buffer +
    copy_orch_phase_buffer.
  - resolve_entry and for_each_instance route on the four
    BufferKinds; ProfBufferType mirrors.
  - JSON emit unchanged on the wire: sched section still has
    "phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
    section still has "phase"/"submit_idx"/"task_id"/timestamps
    (phase string is now hard-coded "orch_submit" since the type tag
    is the truth).

Scheduler call sites:
  - scheduler_dispatch.cpp's three record_phase sites convert to
    record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
    ::Dispatch.

Orchestrator call site:
  - pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
    L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
    signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).

Header thread count (scheduler_cold_path.cpp):
  - sched and orch pool counts computed independently:
      sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
      orch  = orch_to_sched_ ? aicpu_thread_num_ : 1

Python tools:
  - swimlane_converter.py and sched_overhead_analysis.py read the
    same field names; orch section's phase key now always equals
    "orch_submit" (was already the only value). No tool changes
    required.

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
    --enable-l2-swimlane level 4): pass
  - sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean

Stacked on top of hw-native-sys#941 (PhaseHeader merge).
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.

Schema (header):
  - L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
  - L2SwimlaneAicpuSchedPhaseRecord (40B):
      start_time, end_time, loop_iter, kind, tasks_processed (uint32),
      pop_hit, pop_miss, pad
  - L2SwimlaneAicpuOrchPhaseRecord (32B):
      start_time, end_time, task_id, submit_idx, pad
  - L2SwimlaneBufferKind:
      AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
      (AicoreTask shifted from 2 to 3 to accommodate the split)
  - L2SwimlaneDataHeader carries num_sched_phase_threads +
    num_orch_phase_threads (replaces the single num_phase_threads).
  - calc_perf_data_size_with_phases takes both counts; new
    get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
    replace get_phase_buffer_state.

Dropped (no compat layer):
  - L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
    had been documented as "host parser maps to unknown".
  - L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
  - kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
  - The unused `records_per_thread` PhaseHeader field never had a
    reader.

AICPU collector:
  - s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
    s_orch_phase_pools[]; same for current-buffer caches.
  - l2_swimlane_aicpu_init_phase now takes (worker_count,
    num_sched_phase_threads, num_orch_phase_threads).
  - record_phase split into record_sched_phase (with kind +
    pop_hit/pop_miss named, not extras) and record_orch_phase
    (task_id + submit_idx).
  - switch_phase_buffer + acquire_phase_slot generalized into kind-
    parameterized templates shared by both pool types.
  - flush_phase_buffers drains both pool arrays for the thread.

Host collector:
  - collected_phase_records_ split into collected_sched_phase_records_
    and collected_orch_phase_records_; total_phase_collected_
    similarly split for clean reconcile per kind.
  - copy_phase_buffer split into copy_sched_phase_buffer +
    copy_orch_phase_buffer.
  - resolve_entry and for_each_instance route on the four
    BufferKinds; ProfBufferType mirrors.
  - JSON emit unchanged on the wire: sched section still has
    "phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
    section still has "phase"/"submit_idx"/"task_id"/timestamps
    (phase string is now hard-coded "orch_submit" since the type tag
    is the truth).

Scheduler call sites:
  - scheduler_dispatch.cpp's three record_phase sites convert to
    record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
    ::Dispatch.

Orchestrator call site:
  - pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
    L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
    signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).

Header thread count (scheduler_cold_path.cpp):
  - sched and orch pool counts computed independently:
      sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
      orch  = orch_to_sched_ ? aicpu_thread_num_ : 1

Python tools:
  - swimlane_converter.py and sched_overhead_analysis.py read the
    same field names; orch section's phase key now always equals
    "orch_submit" (was already the only value). No tool changes
    required.

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
    --enable-l2-swimlane level 4): pass
  - sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean

Stacked on top of hw-native-sys#941 (PhaseHeader merge).
@ChaoWao ChaoWao force-pushed the refactor/swimlane-merge-phase-header branch from e6de52f to a0e8d08 Compare May 31, 2026 10:18
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.

Move the three live phase-header fields directly into the root header:
  - num_sched_threads → num_phase_threads (renamed for clarity; it
    counts phase pools, which equals sched_thread_num or
    aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
  - num_cores → num_phase_cores (disambiguate from the root header's
    pre-existing num_cores — they have different semantics)
  - core_to_thread[PLATFORM_MAX_CORES] — verbatim

Dropped:
  - magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
    on `num_phase_threads > 0` (zero-init means phase init never ran).
  - records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
    but no caller ever read it — dead field.

Shared-memory layout after the merge:
  [L2SwimlaneDataHeader (now includes phase metadata)]
  [L2SwimlaneAicpuTaskPool × num_cores]
  [L2SwimlaneAicoreTaskPool × num_cores]
  [L2SwimlaneAicpuPhasePool × num_phase_threads]   ← was preceded by header

`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.

AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.

Built atop hw-native-sys#939 (ActiveHead refactor).

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
  - sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.

Schema (header):
  - L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
  - L2SwimlaneAicpuSchedPhaseRecord (40B):
      start_time, end_time, loop_iter, kind, tasks_processed (uint32),
      pop_hit, pop_miss, pad
  - L2SwimlaneAicpuOrchPhaseRecord (32B):
      start_time, end_time, task_id, submit_idx, pad
  - L2SwimlaneBufferKind:
      AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
      (AicoreTask shifted from 2 to 3 to accommodate the split)
  - L2SwimlaneDataHeader carries num_sched_phase_threads +
    num_orch_phase_threads (replaces the single num_phase_threads).
  - calc_perf_data_size_with_phases takes both counts; new
    get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
    replace get_phase_buffer_state.

Dropped (no compat layer):
  - L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
    had been documented as "host parser maps to unknown".
  - L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
  - kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
  - The unused `records_per_thread` PhaseHeader field never had a
    reader.

AICPU collector:
  - s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
    s_orch_phase_pools[]; same for current-buffer caches.
  - l2_swimlane_aicpu_init_phase now takes (worker_count,
    num_sched_phase_threads, num_orch_phase_threads).
  - record_phase split into record_sched_phase (with kind +
    pop_hit/pop_miss named, not extras) and record_orch_phase
    (task_id + submit_idx).
  - switch_phase_buffer + acquire_phase_slot generalized into kind-
    parameterized templates shared by both pool types.
  - flush_phase_buffers drains both pool arrays for the thread.

Host collector:
  - collected_phase_records_ split into collected_sched_phase_records_
    and collected_orch_phase_records_; total_phase_collected_
    similarly split for clean reconcile per kind.
  - copy_phase_buffer split into copy_sched_phase_buffer +
    copy_orch_phase_buffer.
  - resolve_entry and for_each_instance route on the four
    BufferKinds; ProfBufferType mirrors.
  - JSON emit unchanged on the wire: sched section still has
    "phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
    section still has "phase"/"submit_idx"/"task_id"/timestamps
    (phase string is now hard-coded "orch_submit" since the type tag
    is the truth).

Scheduler call sites:
  - scheduler_dispatch.cpp's three record_phase sites convert to
    record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
    ::Dispatch.

Orchestrator call site:
  - pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
    L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
    signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).

Header thread count (scheduler_cold_path.cpp):
  - sched and orch pool counts computed independently:
      sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
      orch  = orch_to_sched_ ? aicpu_thread_num_ : 1

Python tools:
  - swimlane_converter.py and sched_overhead_analysis.py read the
    same field names; orch section's phase key now always equals
    "orch_submit" (was already the only value). No tool changes
    required.

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
    --enable-l2-swimlane level 4): pass
  - sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean

Stacked on top of hw-native-sys#941 (PhaseHeader merge).
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.

Schema (header):
  - L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
  - L2SwimlaneAicpuSchedPhaseRecord (40B):
      start_time, end_time, loop_iter, kind, tasks_processed (uint32),
      pop_hit, pop_miss, pad
  - L2SwimlaneAicpuOrchPhaseRecord (32B):
      start_time, end_time, task_id, submit_idx, pad
  - L2SwimlaneBufferKind:
      AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
      (AicoreTask shifted from 2 to 3 to accommodate the split)
  - L2SwimlaneDataHeader carries num_sched_phase_threads +
    num_orch_phase_threads (replaces the single num_phase_threads).
  - calc_perf_data_size_with_phases takes both counts; new
    get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
    replace get_phase_buffer_state.

Dropped (no compat layer):
  - L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
    had been documented as "host parser maps to unknown".
  - L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
  - kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
  - The unused `records_per_thread` PhaseHeader field never had a
    reader.

AICPU collector:
  - s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
    s_orch_phase_pools[]; same for current-buffer caches.
  - l2_swimlane_aicpu_init_phase now takes (worker_count,
    num_sched_phase_threads, num_orch_phase_threads).
  - record_phase split into record_sched_phase (with kind +
    pop_hit/pop_miss named, not extras) and record_orch_phase
    (task_id + submit_idx).
  - switch_phase_buffer + acquire_phase_slot generalized into kind-
    parameterized templates shared by both pool types.
  - flush_phase_buffers drains both pool arrays for the thread.

Host collector:
  - collected_phase_records_ split into collected_sched_phase_records_
    and collected_orch_phase_records_; total_phase_collected_
    similarly split for clean reconcile per kind.
  - copy_phase_buffer split into copy_sched_phase_buffer +
    copy_orch_phase_buffer.
  - resolve_entry and for_each_instance route on the four
    BufferKinds; ProfBufferType mirrors.
  - JSON emit unchanged on the wire: sched section still has
    "phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
    section still has "phase"/"submit_idx"/"task_id"/timestamps
    (phase string is now hard-coded "orch_submit" since the type tag
    is the truth).

Scheduler call sites:
  - scheduler_dispatch.cpp's three record_phase sites convert to
    record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
    ::Dispatch.

Orchestrator call site:
  - pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
    L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
    signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).

Header thread count (scheduler_cold_path.cpp):
  - sched and orch pool counts computed independently:
      sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
      orch  = orch_to_sched_ ? aicpu_thread_num_ : 1

Python tools:
  - swimlane_converter.py and sched_overhead_analysis.py read the
    same field names; orch section's phase key now always equals
    "orch_submit" (was already the only value). No tool changes
    required.

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
    --enable-l2-swimlane level 4): pass
  - sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean

Stacked on top of hw-native-sys#941 (PhaseHeader merge).
@ChaoWao ChaoWao merged commit 0331dcc into hw-native-sys:main May 31, 2026
15 checks passed
@ChaoWao ChaoWao deleted the refactor/swimlane-merge-phase-header branch May 31, 2026 10:46
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.

Schema (header):
  - L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
  - L2SwimlaneAicpuSchedPhaseRecord (40B):
      start_time, end_time, loop_iter, kind, tasks_processed (uint32),
      pop_hit, pop_miss, pad
  - L2SwimlaneAicpuOrchPhaseRecord (32B):
      start_time, end_time, task_id, submit_idx, pad
  - L2SwimlaneBufferKind:
      AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
      (AicoreTask shifted from 2 to 3 to accommodate the split)
  - L2SwimlaneDataHeader carries num_sched_phase_threads +
    num_orch_phase_threads (replaces the single num_phase_threads).
  - calc_perf_data_size_with_phases takes both counts; new
    get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
    replace get_phase_buffer_state.

Dropped (no compat layer):
  - L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
    had been documented as "host parser maps to unknown".
  - L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
  - kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
  - The unused `records_per_thread` PhaseHeader field never had a
    reader.

AICPU collector:
  - s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
    s_orch_phase_pools[]; same for current-buffer caches.
  - l2_swimlane_aicpu_init_phase now takes (worker_count,
    num_sched_phase_threads, num_orch_phase_threads).
  - record_phase split into record_sched_phase (with kind +
    pop_hit/pop_miss named, not extras) and record_orch_phase
    (task_id + submit_idx).
  - switch_phase_buffer + acquire_phase_slot generalized into kind-
    parameterized templates shared by both pool types.
  - flush_phase_buffers drains both pool arrays for the thread.

Host collector:
  - collected_phase_records_ split into collected_sched_phase_records_
    and collected_orch_phase_records_; total_phase_collected_
    similarly split for clean reconcile per kind.
  - copy_phase_buffer split into copy_sched_phase_buffer +
    copy_orch_phase_buffer.
  - resolve_entry and for_each_instance route on the four
    BufferKinds; ProfBufferType mirrors.
  - JSON emit unchanged on the wire: sched section still has
    "phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
    section still has "phase"/"submit_idx"/"task_id"/timestamps
    (phase string is now hard-coded "orch_submit" since the type tag
    is the truth).

Scheduler call sites:
  - scheduler_dispatch.cpp's three record_phase sites convert to
    record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
    ::Dispatch.

Orchestrator call site:
  - pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
    L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
    signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).

Header thread count (scheduler_cold_path.cpp):
  - sched and orch pool counts computed independently:
      sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
      orch  = orch_to_sched_ ? aicpu_thread_num_ : 1

Python tools:
  - swimlane_converter.py and sched_overhead_analysis.py read the
    same field names; orch section's phase key now always equals
    "orch_submit" (was already the only value). No tool changes
    required.

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
    --enable-l2-swimlane level 4): pass
  - sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean

Stacked on top of hw-native-sys#941 (PhaseHeader merge).
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.

Schema (header):
  - L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
  - L2SwimlaneAicpuSchedPhaseRecord (40B):
      start_time, end_time, loop_iter, kind, tasks_processed (uint32),
      pop_hit, pop_miss, pad
  - L2SwimlaneAicpuOrchPhaseRecord (32B):
      start_time, end_time, task_id, submit_idx, pad
  - L2SwimlaneBufferKind:
      AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
      (AicoreTask shifted from 2 to 3 to accommodate the split)
  - L2SwimlaneDataHeader carries num_sched_phase_threads +
    num_orch_phase_threads (replaces the single num_phase_threads).
  - calc_perf_data_size_with_phases takes both counts; new
    get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
    replace get_phase_buffer_state.

Dropped (no compat layer):
  - L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
    had been documented as "host parser maps to unknown".
  - L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
  - kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
  - The unused `records_per_thread` PhaseHeader field never had a
    reader.

AICPU collector:
  - s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
    s_orch_phase_pools[]; same for current-buffer caches.
  - l2_swimlane_aicpu_init_phase now takes (worker_count,
    num_sched_phase_threads, num_orch_phase_threads).
  - record_phase split into record_sched_phase (with kind +
    pop_hit/pop_miss named, not extras) and record_orch_phase
    (task_id + submit_idx).
  - switch_phase_buffer + acquire_phase_slot generalized into kind-
    parameterized templates shared by both pool types.
  - flush_phase_buffers drains both pool arrays for the thread.

Host collector:
  - collected_phase_records_ split into collected_sched_phase_records_
    and collected_orch_phase_records_; total_phase_collected_
    similarly split for clean reconcile per kind.
  - copy_phase_buffer split into copy_sched_phase_buffer +
    copy_orch_phase_buffer.
  - resolve_entry and for_each_instance route on the four
    BufferKinds; ProfBufferType mirrors.
  - JSON emit unchanged on the wire: sched section still has
    "phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
    section still has "phase"/"submit_idx"/"task_id"/timestamps
    (phase string is now hard-coded "orch_submit" since the type tag
    is the truth).

Scheduler call sites:
  - scheduler_dispatch.cpp's three record_phase sites convert to
    record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
    ::Dispatch.

Orchestrator call site:
  - pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
    L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
    signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).

Header thread count (scheduler_cold_path.cpp):
  - sched and orch pool counts computed independently:
      sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
      orch  = orch_to_sched_ ? aicpu_thread_num_ : 1

Python tools:
  - swimlane_converter.py and sched_overhead_analysis.py read the
    same field names; orch section's phase key now always equals
    "orch_submit" (was already the only value). No tool changes
    required.

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
    --enable-l2-swimlane level 4): pass
  - sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean

Stacked on top of hw-native-sys#941 (PhaseHeader merge).
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.

Schema (header):
  - L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
  - L2SwimlaneAicpuSchedPhaseRecord (40B):
      start_time, end_time, loop_iter, kind, tasks_processed (uint32),
      pop_hit, pop_miss, pad
  - L2SwimlaneAicpuOrchPhaseRecord (32B):
      start_time, end_time, task_id, submit_idx, pad
  - L2SwimlaneBufferKind:
      AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
      (AicoreTask shifted from 2 to 3 to accommodate the split)
  - L2SwimlaneDataHeader carries num_sched_phase_threads +
    num_orch_phase_threads (replaces the single num_phase_threads).
  - calc_perf_data_size_with_phases takes both counts; new
    get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
    replace get_phase_buffer_state.

Dropped (no compat layer):
  - L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
    had been documented as "host parser maps to unknown".
  - L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
  - kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
  - The unused `records_per_thread` PhaseHeader field never had a
    reader.

AICPU collector:
  - s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
    s_orch_phase_pools[]; same for current-buffer caches.
  - l2_swimlane_aicpu_init_phase now takes (worker_count,
    num_sched_phase_threads, num_orch_phase_threads).
  - record_phase split into record_sched_phase (with kind +
    pop_hit/pop_miss named, not extras) and record_orch_phase
    (task_id + submit_idx).
  - switch_phase_buffer + acquire_phase_slot generalized into kind-
    parameterized templates shared by both pool types.
  - flush_phase_buffers drains both pool arrays for the thread.

Host collector:
  - collected_phase_records_ split into collected_sched_phase_records_
    and collected_orch_phase_records_; total_phase_collected_
    similarly split for clean reconcile per kind.
  - copy_phase_buffer split into copy_sched_phase_buffer +
    copy_orch_phase_buffer.
  - resolve_entry and for_each_instance route on the four
    BufferKinds; ProfBufferType mirrors.
  - JSON emit unchanged on the wire: sched section still has
    "phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
    section still has "phase"/"submit_idx"/"task_id"/timestamps
    (phase string is now hard-coded "orch_submit" since the type tag
    is the truth).

Scheduler call sites:
  - scheduler_dispatch.cpp's three record_phase sites convert to
    record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
    ::Dispatch.

Orchestrator call site:
  - pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
    L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
    signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).

Header thread count (scheduler_cold_path.cpp):
  - sched and orch pool counts computed independently:
      sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
      orch  = orch_to_sched_ ? aicpu_thread_num_ : 1

Python tools:
  - swimlane_converter.py and sched_overhead_analysis.py read the
    same field names; orch section's phase key now always equals
    "orch_submit" (was already the only value). No tool changes
    required.

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
    --enable-l2-swimlane level 4): pass
  - sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean

Stacked on top of hw-native-sys#941 (PhaseHeader merge).
ChaoWao added a commit that referenced this pull request May 31, 2026
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.

Schema (header):
  - L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
  - L2SwimlaneAicpuSchedPhaseRecord (40B):
      start_time, end_time, loop_iter, kind, tasks_processed (uint32),
      pop_hit, pop_miss, pad
  - L2SwimlaneAicpuOrchPhaseRecord (32B):
      start_time, end_time, task_id, submit_idx, pad
  - L2SwimlaneBufferKind:
      AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
      (AicoreTask shifted from 2 to 3 to accommodate the split)
  - L2SwimlaneDataHeader carries num_sched_phase_threads +
    num_orch_phase_threads (replaces the single num_phase_threads).
  - calc_perf_data_size_with_phases takes both counts; new
    get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
    replace get_phase_buffer_state.

Dropped (no compat layer):
  - L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
    had been documented as "host parser maps to unknown".
  - L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
  - kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
  - The unused `records_per_thread` PhaseHeader field never had a
    reader.

AICPU collector:
  - s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
    s_orch_phase_pools[]; same for current-buffer caches.
  - l2_swimlane_aicpu_init_phase now takes (worker_count,
    num_sched_phase_threads, num_orch_phase_threads).
  - record_phase split into record_sched_phase (with kind +
    pop_hit/pop_miss named, not extras) and record_orch_phase
    (task_id + submit_idx).
  - switch_phase_buffer + acquire_phase_slot generalized into kind-
    parameterized templates shared by both pool types.
  - flush_phase_buffers drains both pool arrays for the thread.

Host collector:
  - collected_phase_records_ split into collected_sched_phase_records_
    and collected_orch_phase_records_; total_phase_collected_
    similarly split for clean reconcile per kind.
  - copy_phase_buffer split into copy_sched_phase_buffer +
    copy_orch_phase_buffer.
  - resolve_entry and for_each_instance route on the four
    BufferKinds; ProfBufferType mirrors.
  - JSON emit unchanged on the wire: sched section still has
    "phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
    section still has "phase"/"submit_idx"/"task_id"/timestamps
    (phase string is now hard-coded "orch_submit" since the type tag
    is the truth).

Scheduler call sites:
  - scheduler_dispatch.cpp's three record_phase sites convert to
    record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
    ::Dispatch.

Orchestrator call site:
  - pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
    L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
    signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).

Header thread count (scheduler_cold_path.cpp):
  - sched and orch pool counts computed independently:
      sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
      orch  = orch_to_sched_ ? aicpu_thread_num_ : 1

Python tools:
  - swimlane_converter.py and sched_overhead_analysis.py read the
    same field names; orch section's phase key now always equals
    "orch_submit" (was already the only value). No tool changes
    required.

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
    --enable-l2-swimlane level 4): pass
  - sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean

Stacked on top of #941 (PhaseHeader merge).

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
ChaoWao added a commit that referenced this pull request May 31, 2026
After #939 (pool unification), #941 (PhaseHeader merge), and #942
(split sched/orch phase records), several comments and doc sections
still referenced the pre-split a2a3 layout. Audit and update:

a2a3 code/comments:
- platform_config.h: PROF_BUFFERS_PER_THREAD doc references both
  SchedPhaseBuffer and OrchPhaseBuffer (was: single PhaseBuffer);
  PROF_READYQUEUE_SIZE comment now says "four kinds"; formula bumped
  by 2x on the per-thread term to cover both sched and orch pool
  enqueues (matches host alloc which iterates both pool arrays).
- l2_swimlane_profiling.h header layout diagram: name the two split
  phase-thread counts.
- l2_swimlane_collector_aicpu.cpp: cross-launch reset comment now
  references s_sched_phase_pools / s_orch_phase_pools (was: single
  s_aicpu_phase_pools) and record_sched_phase / record_orch_phase.
- scheduler_dispatch.cpp / aicpu_executor.cpp: comments reference
  the split record types.

src/common/ shared comments (now mixed-arch):
- profiler_base.h / buffer_pool_manager.h: qualify
  L2SwimlaneAicpuPhaseHeader::magic example as "on a5" since the
  struct no longer exists on a2a3.

docs/dfx/l2-swimlane-profiling.md:
- §5.1: layout block + record list now distinguish a2a3 split shape
  (SchedPhaseRecord 40B + OrchPhaseRecord 32B, two pool arrays) from
  a5's still-unified shape (pending port).
- §5.2: a2a3 buffer-kind list updated to all four kinds (was: two);
  ASCII data-flow diagram redrawn to show split phase records;
  kBufferKinds = 4 in the L2SwimlaneModule trait description.
- §5.3 (a5): num_phase_threads / core_to_thread[] reference corrected
  to live in L2SwimlaneAicpuPhaseHeader on a5 (was wrongly attributed
  to L2SwimlaneDataHeader).
- §5.4: comparison table separates task record (identical) from
  phase record (diverged); ready-queue and kBufferKinds rows
  call out the a2a3=4 vs a5=2 split.
- §6: overhead description differentiates a2a3's per-emit
  SchedPhase + per-submit OrchPhase from a5's unified PhaseRecord
  (was: "4 phases × 40B per iteration", which described a removed
  shape).
- §8 FAQ: "phase records empty" entry gates a2a3 on
  num_{sched,orch}_phase_threads, a5 on PhaseHeader::magic.

No semantic code changes except the READYQUEUE_SIZE formula bump
(adds ~8KB to the header; necessary correctness fix given the second
phase pool).

Test plan:
- pre-commit clean
- onboard l2_swimlane STs (--enable-l2-swimlane --enable-dep-gen): 2 passed
- onboard paged_attention_unroll level 4: 1 passed

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants