Refactor: merge L2SwimlaneAicpuPhaseHeader into L2SwimlaneDataHeader#941
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughPhase profiling metadata is consolidated from a standalone ChangesPhase metadata structure consolidation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the L2 swimlane profiling memory layout by merging L2SwimlaneAicpuPhaseHeader into L2SwimlaneDataHeader and introducing a unified L2SwimlaneActiveHead cache-line structure shared across all pool types. Feedback on these changes highlights three critical issues: first, consolidating record counters into the same cache line as the active buffer pointer and sequence number introduces hot-path cache bouncing between AICore and AICPU; second, the newly merged phase metadata fields in L2SwimlaneDataHeader are left uninitialized during collector initialization, risking garbage reads; and third, the static s_phase_initialized flag on AICPU is not reset between launches, which can leak state and cause undefined behavior.
7dc8958 to
2474465
Compare
2474465 to
84f46b1
Compare
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.
Schema (header):
- L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
- L2SwimlaneAicpuSchedPhaseRecord (40B):
start_time, end_time, loop_iter, kind, tasks_processed (uint32),
pop_hit, pop_miss, pad
- L2SwimlaneAicpuOrchPhaseRecord (32B):
start_time, end_time, task_id, submit_idx, pad
- L2SwimlaneBufferKind:
AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
(AicoreTask shifted from 2 to 3 to accommodate the split)
- L2SwimlaneDataHeader carries num_sched_phase_threads +
num_orch_phase_threads (replaces the single num_phase_threads).
- calc_perf_data_size_with_phases takes both counts; new
get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
replace get_phase_buffer_state.
Dropped (no compat layer):
- L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
had been documented as "host parser maps to unknown".
- L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
- kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
- The unused `records_per_thread` PhaseHeader field never had a
reader.
AICPU collector:
- s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
s_orch_phase_pools[]; same for current-buffer caches.
- l2_swimlane_aicpu_init_phase now takes (worker_count,
num_sched_phase_threads, num_orch_phase_threads).
- record_phase split into record_sched_phase (with kind +
pop_hit/pop_miss named, not extras) and record_orch_phase
(task_id + submit_idx).
- switch_phase_buffer + acquire_phase_slot generalized into kind-
parameterized templates shared by both pool types.
- flush_phase_buffers drains both pool arrays for the thread.
Host collector:
- collected_phase_records_ split into collected_sched_phase_records_
and collected_orch_phase_records_; total_phase_collected_
similarly split for clean reconcile per kind.
- copy_phase_buffer split into copy_sched_phase_buffer +
copy_orch_phase_buffer.
- resolve_entry and for_each_instance route on the four
BufferKinds; ProfBufferType mirrors.
- JSON emit unchanged on the wire: sched section still has
"phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
section still has "phase"/"submit_idx"/"task_id"/timestamps
(phase string is now hard-coded "orch_submit" since the type tag
is the truth).
Scheduler call sites:
- scheduler_dispatch.cpp's three record_phase sites convert to
record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
::Dispatch.
Orchestrator call site:
- pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).
Header thread count (scheduler_cold_path.cpp):
- sched and orch pool counts computed independently:
sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
orch = orch_to_sched_ ? aicpu_thread_num_ : 1
Python tools:
- swimlane_converter.py and sched_overhead_analysis.py read the
same field names; orch section's phase key now always equals
"orch_submit" (was already the only value). No tool changes
required.
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
--enable-l2-swimlane level 4): pass
- sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
Stacked on top of hw-native-sys#941 (PhaseHeader merge).
84f46b1 to
b0b32d6
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp (2)
602-614:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPublish the actual number of initialized phase pools.
Line 602 stores
num_sched_threads, but Lines 609-614 initializenum_sched_threads + 1pools. The host now usesheader->num_phase_threadsto enumerate phase pools, so the orchestrator pool is skipped for replenish/drain when it has its own slot.Suggested fix
- s_l2_swimlane_header->num_phase_threads = num_sched_threads; - s_l2_swimlane_header->num_phase_cores = 0; - memset(s_l2_swimlane_header->core_to_thread, -1, sizeof(s_l2_swimlane_header->core_to_thread)); - s_phase_initialized = true; - - // Cache per-thread record pointers and clear buffers - // Include all threads: scheduler + orchestrator (orchestrators may become schedulers) int total_threads = num_sched_threads + 1; if (total_threads > PLATFORM_MAX_AICPU_THREADS) { total_threads = PLATFORM_MAX_AICPU_THREADS; } + s_l2_swimlane_header->num_phase_threads = static_cast<uint32_t>(total_threads); + s_l2_swimlane_header->num_phase_cores = 0; + memset(s_l2_swimlane_header->core_to_thread, -1, sizeof(s_l2_swimlane_header->core_to_thread)); + s_phase_initialized = true; + + // Cache per-thread record pointers and clear buffers + // Include all threads: scheduler + orchestrator (orchestrators may become schedulers)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp` around lines 602 - 614, The header field s_l2_swimlane_header->num_phase_threads is set to num_sched_threads but you actually initialize total_threads = num_sched_threads + 1 (capped by PLATFORM_MAX_AICPU_THREADS) and allocate that many phase pools via get_phase_buffer_state; update the code so s_l2_swimlane_header->num_phase_threads is assigned the actual number of initialized pools (total_threads) after applying the cap, ensuring the host will enumerate all created L2SwimlaneAicpuPhasePool entries and not skip the orchestrator pool; keep the cap logic and set s_phase_initialized as before.
133-146:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReset phase-local singleton state at the start of
l2_swimlane_aicpu_init().These statics survive across runs and are never cleared anywhere in this file. After one phase-enabled session, a later session that skips
l2_swimlane_aicpu_init_phase()can still leaves_phase_initialized == truewith stales_aicpu_phase_pools/s_current_aicpu_phase_bufferspointing at the previous shared-memory region.Suggested fix
void l2_swimlane_aicpu_init(int worker_count) { void *l2_swimlane_base = reinterpret_cast<void *>(g_platform_l2_swimlane_base); if (l2_swimlane_base == nullptr) { LOG_ERROR("l2_swimlane_data_base is NULL, cannot initialize profiling"); return; } + s_phase_initialized = false; + s_orch_thread_idx = -1; + memset(s_aicpu_phase_pools, 0, sizeof(s_aicpu_phase_pools)); + memset(s_current_aicpu_phase_buffers, 0, sizeof(s_current_aicpu_phase_buffers)); + s_l2_swimlane_header = get_l2_swimlane_header(l2_swimlane_base);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp` around lines 133 - 146, At the start of l2_swimlane_aicpu_init() reset the phase-local singleton state so stale data from previous runs can't persist: explicitly set s_phase_initialized = false, clear and/or reset s_aicpu_phase_pools (e.g., clear container and release any held pointers) and set s_current_aicpu_phase_buffers = nullptr (or clear its map/vector as appropriate), and reset any other phase-scoped pointers before reading the new shared-memory header (s_l2_swimlane_header) so subsequent sessions that skip l2_swimlane_aicpu_init_phase() won't use stale state.src/a2a3/platform/include/host/l2_swimlane_collector.h (1)
182-189:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate
num_phase_threadsbefore iterating phase pools.This loop now trusts a device-written header field directly. If that count is stale or oversized,
get_phase_buffer_state(shm, num_cores, t)walks past the allocated phase-pool tail on the host.Suggested fix
- const int num_phase_threads = static_cast<int>(header->num_phase_threads); + int num_phase_threads = static_cast<int>(header->num_phase_threads); + if (num_phase_threads < 0 || num_phase_threads > PLATFORM_MAX_AICPU_THREADS) { + LOG_ERROR( + "L2SwimlaneModule: invalid num_phase_threads=%d (max=%d)", + num_phase_threads, PLATFORM_MAX_AICPU_THREADS + ); + num_phase_threads = PLATFORM_MAX_AICPU_THREADS; + } for (int t = 0; t < num_phase_threads; t++) { L2SwimlaneAicpuPhasePool *state = get_phase_buffer_state(shm, num_cores, t); cb(/*kind=*/1, &state->free_queue, sizeof(L2SwimlaneAicpuPhaseBuffer)); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/include/host/l2_swimlane_collector.h` around lines 182 - 189, The code trusts header->num_phase_threads and iterates without bounds checking, which can walk past allocated phase pools; fix by validating and clamping num_phase_threads before the loop: compute a safe max (e.g., derived from the shared-memory layout/allocated phase-pool count or a defined constant like MAX_PHASE_THREADS), ensure the parsed value is non-negative and <= that max (use size_t or unsigned for comparison to avoid signed/unsigned bugs), then loop for t from 0 to min(static_cast<int>(header->num_phase_threads), safe_max)-1 and proceed to call get_phase_buffer_state(shm, num_cores, t) and cb; if the header value is out of range, log or skip the extra entries instead of iterating past bounds.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/a2a3/platform/include/host/l2_swimlane_collector.h`:
- Around line 182-189: The code trusts header->num_phase_threads and iterates
without bounds checking, which can walk past allocated phase pools; fix by
validating and clamping num_phase_threads before the loop: compute a safe max
(e.g., derived from the shared-memory layout/allocated phase-pool count or a
defined constant like MAX_PHASE_THREADS), ensure the parsed value is
non-negative and <= that max (use size_t or unsigned for comparison to avoid
signed/unsigned bugs), then loop for t from 0 to
min(static_cast<int>(header->num_phase_threads), safe_max)-1 and proceed to call
get_phase_buffer_state(shm, num_cores, t) and cb; if the header value is out of
range, log or skip the extra entries instead of iterating past bounds.
In `@src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp`:
- Around line 602-614: The header field s_l2_swimlane_header->num_phase_threads
is set to num_sched_threads but you actually initialize total_threads =
num_sched_threads + 1 (capped by PLATFORM_MAX_AICPU_THREADS) and allocate that
many phase pools via get_phase_buffer_state; update the code so
s_l2_swimlane_header->num_phase_threads is assigned the actual number of
initialized pools (total_threads) after applying the cap, ensuring the host will
enumerate all created L2SwimlaneAicpuPhasePool entries and not skip the
orchestrator pool; keep the cap logic and set s_phase_initialized as before.
- Around line 133-146: At the start of l2_swimlane_aicpu_init() reset the
phase-local singleton state so stale data from previous runs can't persist:
explicitly set s_phase_initialized = false, clear and/or reset
s_aicpu_phase_pools (e.g., clear container and release any held pointers) and
set s_current_aicpu_phase_buffers = nullptr (or clear its map/vector as
appropriate), and reset any other phase-scoped pointers before reading the new
shared-memory header (s_l2_swimlane_header) so subsequent sessions that skip
l2_swimlane_aicpu_init_phase() won't use stale state.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8bd99661-e2b3-418e-b7ff-3e1d29229a15
📒 Files selected for processing (5)
src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.hsrc/a2a3/platform/include/common/l2_swimlane_profiling.hsrc/a2a3/platform/include/host/l2_swimlane_collector.hsrc/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cppsrc/a2a3/platform/src/host/l2_swimlane_collector.cpp
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.
Schema (header):
- L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
- L2SwimlaneAicpuSchedPhaseRecord (40B):
start_time, end_time, loop_iter, kind, tasks_processed (uint32),
pop_hit, pop_miss, pad
- L2SwimlaneAicpuOrchPhaseRecord (32B):
start_time, end_time, task_id, submit_idx, pad
- L2SwimlaneBufferKind:
AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
(AicoreTask shifted from 2 to 3 to accommodate the split)
- L2SwimlaneDataHeader carries num_sched_phase_threads +
num_orch_phase_threads (replaces the single num_phase_threads).
- calc_perf_data_size_with_phases takes both counts; new
get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
replace get_phase_buffer_state.
Dropped (no compat layer):
- L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
had been documented as "host parser maps to unknown".
- L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
- kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
- The unused `records_per_thread` PhaseHeader field never had a
reader.
AICPU collector:
- s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
s_orch_phase_pools[]; same for current-buffer caches.
- l2_swimlane_aicpu_init_phase now takes (worker_count,
num_sched_phase_threads, num_orch_phase_threads).
- record_phase split into record_sched_phase (with kind +
pop_hit/pop_miss named, not extras) and record_orch_phase
(task_id + submit_idx).
- switch_phase_buffer + acquire_phase_slot generalized into kind-
parameterized templates shared by both pool types.
- flush_phase_buffers drains both pool arrays for the thread.
Host collector:
- collected_phase_records_ split into collected_sched_phase_records_
and collected_orch_phase_records_; total_phase_collected_
similarly split for clean reconcile per kind.
- copy_phase_buffer split into copy_sched_phase_buffer +
copy_orch_phase_buffer.
- resolve_entry and for_each_instance route on the four
BufferKinds; ProfBufferType mirrors.
- JSON emit unchanged on the wire: sched section still has
"phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
section still has "phase"/"submit_idx"/"task_id"/timestamps
(phase string is now hard-coded "orch_submit" since the type tag
is the truth).
Scheduler call sites:
- scheduler_dispatch.cpp's three record_phase sites convert to
record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
::Dispatch.
Orchestrator call site:
- pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).
Header thread count (scheduler_cold_path.cpp):
- sched and orch pool counts computed independently:
sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
orch = orch_to_sched_ ? aicpu_thread_num_ : 1
Python tools:
- swimlane_converter.py and sched_overhead_analysis.py read the
same field names; orch section's phase key now always equals
"orch_submit" (was already the only value). No tool changes
required.
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
--enable-l2-swimlane level 4): pass
- sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
Stacked on top of hw-native-sys#941 (PhaseHeader merge).
b0b32d6 to
e6de52f
Compare
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.
Schema (header):
- L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
- L2SwimlaneAicpuSchedPhaseRecord (40B):
start_time, end_time, loop_iter, kind, tasks_processed (uint32),
pop_hit, pop_miss, pad
- L2SwimlaneAicpuOrchPhaseRecord (32B):
start_time, end_time, task_id, submit_idx, pad
- L2SwimlaneBufferKind:
AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
(AicoreTask shifted from 2 to 3 to accommodate the split)
- L2SwimlaneDataHeader carries num_sched_phase_threads +
num_orch_phase_threads (replaces the single num_phase_threads).
- calc_perf_data_size_with_phases takes both counts; new
get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
replace get_phase_buffer_state.
Dropped (no compat layer):
- L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
had been documented as "host parser maps to unknown".
- L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
- kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
- The unused `records_per_thread` PhaseHeader field never had a
reader.
AICPU collector:
- s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
s_orch_phase_pools[]; same for current-buffer caches.
- l2_swimlane_aicpu_init_phase now takes (worker_count,
num_sched_phase_threads, num_orch_phase_threads).
- record_phase split into record_sched_phase (with kind +
pop_hit/pop_miss named, not extras) and record_orch_phase
(task_id + submit_idx).
- switch_phase_buffer + acquire_phase_slot generalized into kind-
parameterized templates shared by both pool types.
- flush_phase_buffers drains both pool arrays for the thread.
Host collector:
- collected_phase_records_ split into collected_sched_phase_records_
and collected_orch_phase_records_; total_phase_collected_
similarly split for clean reconcile per kind.
- copy_phase_buffer split into copy_sched_phase_buffer +
copy_orch_phase_buffer.
- resolve_entry and for_each_instance route on the four
BufferKinds; ProfBufferType mirrors.
- JSON emit unchanged on the wire: sched section still has
"phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
section still has "phase"/"submit_idx"/"task_id"/timestamps
(phase string is now hard-coded "orch_submit" since the type tag
is the truth).
Scheduler call sites:
- scheduler_dispatch.cpp's three record_phase sites convert to
record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
::Dispatch.
Orchestrator call site:
- pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).
Header thread count (scheduler_cold_path.cpp):
- sched and orch pool counts computed independently:
sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
orch = orch_to_sched_ ? aicpu_thread_num_ : 1
Python tools:
- swimlane_converter.py and sched_overhead_analysis.py read the
same field names; orch section's phase key now always equals
"orch_submit" (was already the only value). No tool changes
required.
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
--enable-l2-swimlane level 4): pass
- sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
Stacked on top of hw-native-sys#941 (PhaseHeader merge).
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.
Schema (header):
- L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
- L2SwimlaneAicpuSchedPhaseRecord (40B):
start_time, end_time, loop_iter, kind, tasks_processed (uint32),
pop_hit, pop_miss, pad
- L2SwimlaneAicpuOrchPhaseRecord (32B):
start_time, end_time, task_id, submit_idx, pad
- L2SwimlaneBufferKind:
AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
(AicoreTask shifted from 2 to 3 to accommodate the split)
- L2SwimlaneDataHeader carries num_sched_phase_threads +
num_orch_phase_threads (replaces the single num_phase_threads).
- calc_perf_data_size_with_phases takes both counts; new
get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
replace get_phase_buffer_state.
Dropped (no compat layer):
- L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
had been documented as "host parser maps to unknown".
- L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
- kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
- The unused `records_per_thread` PhaseHeader field never had a
reader.
AICPU collector:
- s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
s_orch_phase_pools[]; same for current-buffer caches.
- l2_swimlane_aicpu_init_phase now takes (worker_count,
num_sched_phase_threads, num_orch_phase_threads).
- record_phase split into record_sched_phase (with kind +
pop_hit/pop_miss named, not extras) and record_orch_phase
(task_id + submit_idx).
- switch_phase_buffer + acquire_phase_slot generalized into kind-
parameterized templates shared by both pool types.
- flush_phase_buffers drains both pool arrays for the thread.
Host collector:
- collected_phase_records_ split into collected_sched_phase_records_
and collected_orch_phase_records_; total_phase_collected_
similarly split for clean reconcile per kind.
- copy_phase_buffer split into copy_sched_phase_buffer +
copy_orch_phase_buffer.
- resolve_entry and for_each_instance route on the four
BufferKinds; ProfBufferType mirrors.
- JSON emit unchanged on the wire: sched section still has
"phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
section still has "phase"/"submit_idx"/"task_id"/timestamps
(phase string is now hard-coded "orch_submit" since the type tag
is the truth).
Scheduler call sites:
- scheduler_dispatch.cpp's three record_phase sites convert to
record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
::Dispatch.
Orchestrator call site:
- pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).
Header thread count (scheduler_cold_path.cpp):
- sched and orch pool counts computed independently:
sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
orch = orch_to_sched_ ? aicpu_thread_num_ : 1
Python tools:
- swimlane_converter.py and sched_overhead_analysis.py read the
same field names; orch section's phase key now always equals
"orch_submit" (was already the only value). No tool changes
required.
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
--enable-l2-swimlane level 4): pass
- sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
Stacked on top of hw-native-sys#941 (PhaseHeader merge).
e6de52f to
a0e8d08
Compare
The standalone phase header was a vestige of when phase profiling was
an add-on bolted onto the shared-memory layout. Phase metadata is now
co-equal with task pool metadata, so the dedicated cache line + magic
gate + indirection are pure overhead.
Move the three live phase-header fields directly into the root header:
- num_sched_threads → num_phase_threads (renamed for clarity; it
counts phase pools, which equals sched_thread_num or
aicpu_thread_num depending on PTO2_ORCH_TO_SCHED)
- num_cores → num_phase_cores (disambiguate from the root header's
pre-existing num_cores — they have different semantics)
- core_to_thread[PLATFORM_MAX_CORES] — verbatim
Dropped:
- magic (L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates
on `num_phase_threads > 0` (zero-init means phase init never ran).
- records_per_thread: AICPU wrote it as PLATFORM_PHASE_RECORDS_PER_THREAD
but no caller ever read it — dead field.
Shared-memory layout after the merge:
[L2SwimlaneDataHeader (now includes phase metadata)]
[L2SwimlaneAicpuTaskPool × num_cores]
[L2SwimlaneAicoreTaskPool × num_cores]
[L2SwimlaneAicpuPhasePool × num_phase_threads] ← was preceded by header
`get_phase_header()` is deleted; `get_phase_buffer_states()` skips
straight from the AicoreTaskPool array to the phase pools.
AICPU collector keeps a separate `s_phase_initialized` bool so gated
paths can check init-ran without re-reading the device-shared header
on the hot path. Replaces the old `s_l2_swimlane_aicpu_phase_header
== nullptr` check.
Built atop hw-native-sys#939 (ActiveHead refactor).
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed): pass
- sim DFX tests (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.
Schema (header):
- L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
- L2SwimlaneAicpuSchedPhaseRecord (40B):
start_time, end_time, loop_iter, kind, tasks_processed (uint32),
pop_hit, pop_miss, pad
- L2SwimlaneAicpuOrchPhaseRecord (32B):
start_time, end_time, task_id, submit_idx, pad
- L2SwimlaneBufferKind:
AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
(AicoreTask shifted from 2 to 3 to accommodate the split)
- L2SwimlaneDataHeader carries num_sched_phase_threads +
num_orch_phase_threads (replaces the single num_phase_threads).
- calc_perf_data_size_with_phases takes both counts; new
get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
replace get_phase_buffer_state.
Dropped (no compat layer):
- L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
had been documented as "host parser maps to unknown".
- L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
- kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
- The unused `records_per_thread` PhaseHeader field never had a
reader.
AICPU collector:
- s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
s_orch_phase_pools[]; same for current-buffer caches.
- l2_swimlane_aicpu_init_phase now takes (worker_count,
num_sched_phase_threads, num_orch_phase_threads).
- record_phase split into record_sched_phase (with kind +
pop_hit/pop_miss named, not extras) and record_orch_phase
(task_id + submit_idx).
- switch_phase_buffer + acquire_phase_slot generalized into kind-
parameterized templates shared by both pool types.
- flush_phase_buffers drains both pool arrays for the thread.
Host collector:
- collected_phase_records_ split into collected_sched_phase_records_
and collected_orch_phase_records_; total_phase_collected_
similarly split for clean reconcile per kind.
- copy_phase_buffer split into copy_sched_phase_buffer +
copy_orch_phase_buffer.
- resolve_entry and for_each_instance route on the four
BufferKinds; ProfBufferType mirrors.
- JSON emit unchanged on the wire: sched section still has
"phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
section still has "phase"/"submit_idx"/"task_id"/timestamps
(phase string is now hard-coded "orch_submit" since the type tag
is the truth).
Scheduler call sites:
- scheduler_dispatch.cpp's three record_phase sites convert to
record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
::Dispatch.
Orchestrator call site:
- pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).
Header thread count (scheduler_cold_path.cpp):
- sched and orch pool counts computed independently:
sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
orch = orch_to_sched_ ? aicpu_thread_num_ : 1
Python tools:
- swimlane_converter.py and sched_overhead_analysis.py read the
same field names; orch section's phase key now always equals
"orch_submit" (was already the only value). No tool changes
required.
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
--enable-l2-swimlane level 4): pass
- sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
Stacked on top of hw-native-sys#941 (PhaseHeader merge).
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.
Schema (header):
- L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
- L2SwimlaneAicpuSchedPhaseRecord (40B):
start_time, end_time, loop_iter, kind, tasks_processed (uint32),
pop_hit, pop_miss, pad
- L2SwimlaneAicpuOrchPhaseRecord (32B):
start_time, end_time, task_id, submit_idx, pad
- L2SwimlaneBufferKind:
AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
(AicoreTask shifted from 2 to 3 to accommodate the split)
- L2SwimlaneDataHeader carries num_sched_phase_threads +
num_orch_phase_threads (replaces the single num_phase_threads).
- calc_perf_data_size_with_phases takes both counts; new
get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
replace get_phase_buffer_state.
Dropped (no compat layer):
- L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
had been documented as "host parser maps to unknown".
- L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
- kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
- The unused `records_per_thread` PhaseHeader field never had a
reader.
AICPU collector:
- s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
s_orch_phase_pools[]; same for current-buffer caches.
- l2_swimlane_aicpu_init_phase now takes (worker_count,
num_sched_phase_threads, num_orch_phase_threads).
- record_phase split into record_sched_phase (with kind +
pop_hit/pop_miss named, not extras) and record_orch_phase
(task_id + submit_idx).
- switch_phase_buffer + acquire_phase_slot generalized into kind-
parameterized templates shared by both pool types.
- flush_phase_buffers drains both pool arrays for the thread.
Host collector:
- collected_phase_records_ split into collected_sched_phase_records_
and collected_orch_phase_records_; total_phase_collected_
similarly split for clean reconcile per kind.
- copy_phase_buffer split into copy_sched_phase_buffer +
copy_orch_phase_buffer.
- resolve_entry and for_each_instance route on the four
BufferKinds; ProfBufferType mirrors.
- JSON emit unchanged on the wire: sched section still has
"phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
section still has "phase"/"submit_idx"/"task_id"/timestamps
(phase string is now hard-coded "orch_submit" since the type tag
is the truth).
Scheduler call sites:
- scheduler_dispatch.cpp's three record_phase sites convert to
record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
::Dispatch.
Orchestrator call site:
- pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).
Header thread count (scheduler_cold_path.cpp):
- sched and orch pool counts computed independently:
sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
orch = orch_to_sched_ ? aicpu_thread_num_ : 1
Python tools:
- swimlane_converter.py and sched_overhead_analysis.py read the
same field names; orch section's phase key now always equals
"orch_submit" (was already the only value). No tool changes
required.
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
--enable-l2-swimlane level 4): pass
- sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
Stacked on top of hw-native-sys#941 (PhaseHeader merge).
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.
Schema (header):
- L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
- L2SwimlaneAicpuSchedPhaseRecord (40B):
start_time, end_time, loop_iter, kind, tasks_processed (uint32),
pop_hit, pop_miss, pad
- L2SwimlaneAicpuOrchPhaseRecord (32B):
start_time, end_time, task_id, submit_idx, pad
- L2SwimlaneBufferKind:
AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
(AicoreTask shifted from 2 to 3 to accommodate the split)
- L2SwimlaneDataHeader carries num_sched_phase_threads +
num_orch_phase_threads (replaces the single num_phase_threads).
- calc_perf_data_size_with_phases takes both counts; new
get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
replace get_phase_buffer_state.
Dropped (no compat layer):
- L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
had been documented as "host parser maps to unknown".
- L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
- kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
- The unused `records_per_thread` PhaseHeader field never had a
reader.
AICPU collector:
- s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
s_orch_phase_pools[]; same for current-buffer caches.
- l2_swimlane_aicpu_init_phase now takes (worker_count,
num_sched_phase_threads, num_orch_phase_threads).
- record_phase split into record_sched_phase (with kind +
pop_hit/pop_miss named, not extras) and record_orch_phase
(task_id + submit_idx).
- switch_phase_buffer + acquire_phase_slot generalized into kind-
parameterized templates shared by both pool types.
- flush_phase_buffers drains both pool arrays for the thread.
Host collector:
- collected_phase_records_ split into collected_sched_phase_records_
and collected_orch_phase_records_; total_phase_collected_
similarly split for clean reconcile per kind.
- copy_phase_buffer split into copy_sched_phase_buffer +
copy_orch_phase_buffer.
- resolve_entry and for_each_instance route on the four
BufferKinds; ProfBufferType mirrors.
- JSON emit unchanged on the wire: sched section still has
"phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
section still has "phase"/"submit_idx"/"task_id"/timestamps
(phase string is now hard-coded "orch_submit" since the type tag
is the truth).
Scheduler call sites:
- scheduler_dispatch.cpp's three record_phase sites convert to
record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
::Dispatch.
Orchestrator call site:
- pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).
Header thread count (scheduler_cold_path.cpp):
- sched and orch pool counts computed independently:
sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
orch = orch_to_sched_ ? aicpu_thread_num_ : 1
Python tools:
- swimlane_converter.py and sched_overhead_analysis.py read the
same field names; orch section's phase key now always equals
"orch_submit" (was already the only value). No tool changes
required.
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
--enable-l2-swimlane level 4): pass
- sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
Stacked on top of hw-native-sys#941 (PhaseHeader merge).
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.
Schema (header):
- L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
- L2SwimlaneAicpuSchedPhaseRecord (40B):
start_time, end_time, loop_iter, kind, tasks_processed (uint32),
pop_hit, pop_miss, pad
- L2SwimlaneAicpuOrchPhaseRecord (32B):
start_time, end_time, task_id, submit_idx, pad
- L2SwimlaneBufferKind:
AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
(AicoreTask shifted from 2 to 3 to accommodate the split)
- L2SwimlaneDataHeader carries num_sched_phase_threads +
num_orch_phase_threads (replaces the single num_phase_threads).
- calc_perf_data_size_with_phases takes both counts; new
get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
replace get_phase_buffer_state.
Dropped (no compat layer):
- L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
had been documented as "host parser maps to unknown".
- L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
- kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
- The unused `records_per_thread` PhaseHeader field never had a
reader.
AICPU collector:
- s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
s_orch_phase_pools[]; same for current-buffer caches.
- l2_swimlane_aicpu_init_phase now takes (worker_count,
num_sched_phase_threads, num_orch_phase_threads).
- record_phase split into record_sched_phase (with kind +
pop_hit/pop_miss named, not extras) and record_orch_phase
(task_id + submit_idx).
- switch_phase_buffer + acquire_phase_slot generalized into kind-
parameterized templates shared by both pool types.
- flush_phase_buffers drains both pool arrays for the thread.
Host collector:
- collected_phase_records_ split into collected_sched_phase_records_
and collected_orch_phase_records_; total_phase_collected_
similarly split for clean reconcile per kind.
- copy_phase_buffer split into copy_sched_phase_buffer +
copy_orch_phase_buffer.
- resolve_entry and for_each_instance route on the four
BufferKinds; ProfBufferType mirrors.
- JSON emit unchanged on the wire: sched section still has
"phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
section still has "phase"/"submit_idx"/"task_id"/timestamps
(phase string is now hard-coded "orch_submit" since the type tag
is the truth).
Scheduler call sites:
- scheduler_dispatch.cpp's three record_phase sites convert to
record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
::Dispatch.
Orchestrator call site:
- pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).
Header thread count (scheduler_cold_path.cpp):
- sched and orch pool counts computed independently:
sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
orch = orch_to_sched_ ? aicpu_thread_num_ : 1
Python tools:
- swimlane_converter.py and sched_overhead_analysis.py read the
same field names; orch section's phase key now always equals
"orch_submit" (was already the only value). No tool changes
required.
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
--enable-l2-swimlane level 4): pass
- sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
Stacked on top of hw-native-sys#941 (PhaseHeader merge).
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.
Schema (header):
- L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
- L2SwimlaneAicpuSchedPhaseRecord (40B):
start_time, end_time, loop_iter, kind, tasks_processed (uint32),
pop_hit, pop_miss, pad
- L2SwimlaneAicpuOrchPhaseRecord (32B):
start_time, end_time, task_id, submit_idx, pad
- L2SwimlaneBufferKind:
AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
(AicoreTask shifted from 2 to 3 to accommodate the split)
- L2SwimlaneDataHeader carries num_sched_phase_threads +
num_orch_phase_threads (replaces the single num_phase_threads).
- calc_perf_data_size_with_phases takes both counts; new
get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
replace get_phase_buffer_state.
Dropped (no compat layer):
- L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
had been documented as "host parser maps to unknown".
- L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
- kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
- The unused `records_per_thread` PhaseHeader field never had a
reader.
AICPU collector:
- s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
s_orch_phase_pools[]; same for current-buffer caches.
- l2_swimlane_aicpu_init_phase now takes (worker_count,
num_sched_phase_threads, num_orch_phase_threads).
- record_phase split into record_sched_phase (with kind +
pop_hit/pop_miss named, not extras) and record_orch_phase
(task_id + submit_idx).
- switch_phase_buffer + acquire_phase_slot generalized into kind-
parameterized templates shared by both pool types.
- flush_phase_buffers drains both pool arrays for the thread.
Host collector:
- collected_phase_records_ split into collected_sched_phase_records_
and collected_orch_phase_records_; total_phase_collected_
similarly split for clean reconcile per kind.
- copy_phase_buffer split into copy_sched_phase_buffer +
copy_orch_phase_buffer.
- resolve_entry and for_each_instance route on the four
BufferKinds; ProfBufferType mirrors.
- JSON emit unchanged on the wire: sched section still has
"phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
section still has "phase"/"submit_idx"/"task_id"/timestamps
(phase string is now hard-coded "orch_submit" since the type tag
is the truth).
Scheduler call sites:
- scheduler_dispatch.cpp's three record_phase sites convert to
record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
::Dispatch.
Orchestrator call site:
- pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).
Header thread count (scheduler_cold_path.cpp):
- sched and orch pool counts computed independently:
sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
orch = orch_to_sched_ ? aicpu_thread_num_ : 1
Python tools:
- swimlane_converter.py and sched_overhead_analysis.py read the
same field names; orch section's phase key now always equals
"orch_submit" (was already the only value). No tool changes
required.
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
--enable-l2-swimlane level 4): pass
- sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
Stacked on top of hw-native-sys#941 (PhaseHeader merge).
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.
Schema (header):
- L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
- L2SwimlaneAicpuSchedPhaseRecord (40B):
start_time, end_time, loop_iter, kind, tasks_processed (uint32),
pop_hit, pop_miss, pad
- L2SwimlaneAicpuOrchPhaseRecord (32B):
start_time, end_time, task_id, submit_idx, pad
- L2SwimlaneBufferKind:
AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
(AicoreTask shifted from 2 to 3 to accommodate the split)
- L2SwimlaneDataHeader carries num_sched_phase_threads +
num_orch_phase_threads (replaces the single num_phase_threads).
- calc_perf_data_size_with_phases takes both counts; new
get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
replace get_phase_buffer_state.
Dropped (no compat layer):
- L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
had been documented as "host parser maps to unknown".
- L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
- kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
- The unused `records_per_thread` PhaseHeader field never had a
reader.
AICPU collector:
- s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
s_orch_phase_pools[]; same for current-buffer caches.
- l2_swimlane_aicpu_init_phase now takes (worker_count,
num_sched_phase_threads, num_orch_phase_threads).
- record_phase split into record_sched_phase (with kind +
pop_hit/pop_miss named, not extras) and record_orch_phase
(task_id + submit_idx).
- switch_phase_buffer + acquire_phase_slot generalized into kind-
parameterized templates shared by both pool types.
- flush_phase_buffers drains both pool arrays for the thread.
Host collector:
- collected_phase_records_ split into collected_sched_phase_records_
and collected_orch_phase_records_; total_phase_collected_
similarly split for clean reconcile per kind.
- copy_phase_buffer split into copy_sched_phase_buffer +
copy_orch_phase_buffer.
- resolve_entry and for_each_instance route on the four
BufferKinds; ProfBufferType mirrors.
- JSON emit unchanged on the wire: sched section still has
"phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
section still has "phase"/"submit_idx"/"task_id"/timestamps
(phase string is now hard-coded "orch_submit" since the type tag
is the truth).
Scheduler call sites:
- scheduler_dispatch.cpp's three record_phase sites convert to
record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
::Dispatch.
Orchestrator call site:
- pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).
Header thread count (scheduler_cold_path.cpp):
- sched and orch pool counts computed independently:
sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
orch = orch_to_sched_ ? aicpu_thread_num_ : 1
Python tools:
- swimlane_converter.py and sched_overhead_analysis.py read the
same field names; orch section's phase key now always equals
"orch_submit" (was already the only value). No tool changes
required.
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
--enable-l2-swimlane level 4): pass
- sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
Stacked on top of #941 (PhaseHeader merge).
Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
After #939 (pool unification), #941 (PhaseHeader merge), and #942 (split sched/orch phase records), several comments and doc sections still referenced the pre-split a2a3 layout. Audit and update: a2a3 code/comments: - platform_config.h: PROF_BUFFERS_PER_THREAD doc references both SchedPhaseBuffer and OrchPhaseBuffer (was: single PhaseBuffer); PROF_READYQUEUE_SIZE comment now says "four kinds"; formula bumped by 2x on the per-thread term to cover both sched and orch pool enqueues (matches host alloc which iterates both pool arrays). - l2_swimlane_profiling.h header layout diagram: name the two split phase-thread counts. - l2_swimlane_collector_aicpu.cpp: cross-launch reset comment now references s_sched_phase_pools / s_orch_phase_pools (was: single s_aicpu_phase_pools) and record_sched_phase / record_orch_phase. - scheduler_dispatch.cpp / aicpu_executor.cpp: comments reference the split record types. src/common/ shared comments (now mixed-arch): - profiler_base.h / buffer_pool_manager.h: qualify L2SwimlaneAicpuPhaseHeader::magic example as "on a5" since the struct no longer exists on a2a3. docs/dfx/l2-swimlane-profiling.md: - §5.1: layout block + record list now distinguish a2a3 split shape (SchedPhaseRecord 40B + OrchPhaseRecord 32B, two pool arrays) from a5's still-unified shape (pending port). - §5.2: a2a3 buffer-kind list updated to all four kinds (was: two); ASCII data-flow diagram redrawn to show split phase records; kBufferKinds = 4 in the L2SwimlaneModule trait description. - §5.3 (a5): num_phase_threads / core_to_thread[] reference corrected to live in L2SwimlaneAicpuPhaseHeader on a5 (was wrongly attributed to L2SwimlaneDataHeader). - §5.4: comparison table separates task record (identical) from phase record (diverged); ready-queue and kBufferKinds rows call out the a2a3=4 vs a5=2 split. - §6: overhead description differentiates a2a3's per-emit SchedPhase + per-submit OrchPhase from a5's unified PhaseRecord (was: "4 phases × 40B per iteration", which described a removed shape). - §8 FAQ: "phase records empty" entry gates a2a3 on num_{sched,orch}_phase_threads, a5 on PhaseHeader::magic. No semantic code changes except the READYQUEUE_SIZE formula bump (adds ~8KB to the header; necessary correctness fix given the second phase pool). Test plan: - pre-commit clean - onboard l2_swimlane STs (--enable-l2-swimlane --enable-dep-gen): 2 passed - onboard paged_attention_unroll level 4: 1 passed Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Summary
The standalone
L2SwimlaneAicpuPhaseHeaderwas a vestige of when phase profiling was an add-on bolted onto the shared-memory layout. Phase metadata is now co-equal with task pool metadata, so the dedicated cache line + magic gate + indirection are pure overhead.This PR moves the three live phase-header fields directly into
L2SwimlaneDataHeader:num_sched_threadsnum_phase_threadssched_thread_num_oraicpu_thread_num_depending onPTO2_ORCH_TO_SCHEDnum_coresnum_phase_coresnum_cores(different semantics)core_to_thread[PLATFORM_MAX_CORES]core_to_thread[PLATFORM_MAX_CORES]Dropped
magic(L2_SWIMLANE_AICPU_PHASE_MAGIC): redundant — host now gates onnum_phase_threads > 0(zero-init means phase init never ran).records_per_thread: AICPU wrote it asPLATFORM_PHASE_RECORDS_PER_THREADbut no caller ever read it — dead field.Shared-memory layout
get_phase_header()is deleted.get_phase_buffer_states()skips straight from the AicoreTaskPool array to the phase pools.AICPU collector init-ran gate
Phase-gated AICPU paths previously checked
s_l2_swimlane_aicpu_phase_header == nullptr. After the merge they check a news_phase_initializedstatic bool, so the hot path doesn't re-read the device-shared header just to test init-ran.Dependency on #939
This branch is stacked on top of
refactor/swimlane-cache-line-blocks(#939). The PR's base ismain(GitHub won't let us target a fork branch as base), so the current diff temporarily includes #939's 11 files on top of this PR's 5 files. After #939 merges, this PR's diff will auto-clean to just C's 5 files (≈75 added / 104 deleted). Review #939 first; only the 5 phase-header files are this PR's contribution:src/a2a3/platform/include/common/l2_swimlane_profiling.hsrc/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.hsrc/a2a3/platform/include/host/l2_swimlane_collector.hsrc/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cppsrc/a2a3/platform/src/host/l2_swimlane_collector.cppTest plan
pytest tests/st/.../dfx/l2_swimlane --platform a2a3sim --enable-l2-swimlanepre-commit runclean