Refactor: split L2 swimlane phase records into sched + orch types#942
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR redesigns L2 swimlane profiling across the entire system, replacing a rotation-channel model with an active-head cache-line model and splitting unified phase profiling into separate scheduler and orchestrator streams. Changes span shared-memory contracts, AICore and AICPU device-side implementations, host collection routing, and all runtime call sites. ChangesL2 Swimlane Profiling Redesign: Rotation Channel to Active Head
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request restructures the L2 Swimlane profiling memory layout by splitting the unified phase profiling into separate, dedicated pools and record types for scheduler and orchestrator phases, while also renaming rotation-related structures to a unified active head concept. A critical issue was identified where the orchestrator phase pool offset is calculated using a dynamic runtime parameter instead of the static PLATFORM_MAX_AICPU_THREADS stride used by the host at allocation, which could lead to memory corruption or type confusion. Actionable suggestions have been provided to statically use the maximum thread count as the stride and update all corresponding call sites.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/a2a3/platform/include/common/l2_swimlane_profiling.h (1)
409-419: 💤 Low valueConsider adding alignment attribute to
L2SwimlaneAicpuSchedPhaseRecord.Unlike other record types in this file (e.g.,
L2SwimlaneAicpuTaskRecordwith__attribute__((aligned(64))),L2SwimlaneAicoreTaskRecordwith__attribute__((aligned(32)))), this 40-byte struct has no alignment attribute. While the 40B size isn't a power of two and won't naturally align to cache lines, the lack of explicit alignment may cause suboptimal memory access patterns when records are stored in arrays.If cache-line alignment is intentional (for consistency with the template buffer), consider adding
__attribute__((aligned(64)))or at minimum documenting why alignment is omitted here.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a2a3/platform/include/common/l2_swimlane_profiling.h` around lines 409 - 419, The L2SwimlaneAicpuSchedPhaseRecord struct lacks an explicit alignment like the other record types; add a matching alignment attribute (e.g., __attribute__((aligned(64)))) to struct L2SwimlaneAicpuSchedPhaseRecord and update the static_assert for sizeof(L2SwimlaneAicpuSchedPhaseRecord) to the new aligned size (64) so the layout check remains correct; alternatively, if alignment was intentionally omitted, add a comment explaining why alignment differs from L2SwimlaneAicpuTaskRecord and L2SwimlaneAicoreTaskRecord.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 855-857: The sched_phase_threads value can be zero when
sched_thread_num_ <= 0 which causes l2_swimlane_aicpu_init_phase to prime no
scheduler pools; change the computation to normalize sched_thread_num_ to the
active AICPU count when non-positive (i.e., use aicpu_thread_num_ if
sched_thread_num_ <= 0) before computing sched_phase_threads and calling
l2_swimlane_aicpu_init_phase so that scheduler phase pools are sized from the
normalized active scheduler thread count; update the calculation around
sched_phase_threads (and keep orch_phase_threads/orch_to_sched_ logic intact)
and ensure l2_swimlane_aicpu_record_sched_phase() will no longer be dropped.
---
Nitpick comments:
In `@src/a2a3/platform/include/common/l2_swimlane_profiling.h`:
- Around line 409-419: The L2SwimlaneAicpuSchedPhaseRecord struct lacks an
explicit alignment like the other record types; add a matching alignment
attribute (e.g., __attribute__((aligned(64)))) to struct
L2SwimlaneAicpuSchedPhaseRecord and update the static_assert for
sizeof(L2SwimlaneAicpuSchedPhaseRecord) to the new aligned size (64) so the
layout check remains correct; alternatively, if alignment was intentionally
omitted, add a comment explaining why alignment differs from
L2SwimlaneAicpuTaskRecord and L2SwimlaneAicoreTaskRecord.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 036d1da2-9e9b-44dc-b62f-81a61544172c
📒 Files selected for processing (14)
src/a2a3/platform/include/aicore/aicore_profiling_state.hsrc/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.hsrc/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.hsrc/a2a3/platform/include/common/l2_swimlane_profiling.hsrc/a2a3/platform/include/host/l2_swimlane_collector.hsrc/a2a3/platform/onboard/aicore/kernel.cppsrc/a2a3/platform/sim/aicore/kernel.cppsrc/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cppsrc/a2a3/platform/src/host/l2_swimlane_collector.cppsrc/a2a3/runtime/host_build_graph/aicore/aicore_executor.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
c5c4d97 to
ffdd557
Compare
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.
Schema (header):
- L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
- L2SwimlaneAicpuSchedPhaseRecord (40B):
start_time, end_time, loop_iter, kind, tasks_processed (uint32),
pop_hit, pop_miss, pad
- L2SwimlaneAicpuOrchPhaseRecord (32B):
start_time, end_time, task_id, submit_idx, pad
- L2SwimlaneBufferKind:
AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
(AicoreTask shifted from 2 to 3 to accommodate the split)
- L2SwimlaneDataHeader carries num_sched_phase_threads +
num_orch_phase_threads (replaces the single num_phase_threads).
- calc_perf_data_size_with_phases takes both counts; new
get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
replace get_phase_buffer_state.
Dropped (no compat layer):
- L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
had been documented as "host parser maps to unknown".
- L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
- kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
- The unused `records_per_thread` PhaseHeader field never had a
reader.
AICPU collector:
- s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
s_orch_phase_pools[]; same for current-buffer caches.
- l2_swimlane_aicpu_init_phase now takes (worker_count,
num_sched_phase_threads, num_orch_phase_threads).
- record_phase split into record_sched_phase (with kind +
pop_hit/pop_miss named, not extras) and record_orch_phase
(task_id + submit_idx).
- switch_phase_buffer + acquire_phase_slot generalized into kind-
parameterized templates shared by both pool types.
- flush_phase_buffers drains both pool arrays for the thread.
Host collector:
- collected_phase_records_ split into collected_sched_phase_records_
and collected_orch_phase_records_; total_phase_collected_
similarly split for clean reconcile per kind.
- copy_phase_buffer split into copy_sched_phase_buffer +
copy_orch_phase_buffer.
- resolve_entry and for_each_instance route on the four
BufferKinds; ProfBufferType mirrors.
- JSON emit unchanged on the wire: sched section still has
"phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
section still has "phase"/"submit_idx"/"task_id"/timestamps
(phase string is now hard-coded "orch_submit" since the type tag
is the truth).
Scheduler call sites:
- scheduler_dispatch.cpp's three record_phase sites convert to
record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
::Dispatch.
Orchestrator call site:
- pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).
Header thread count (scheduler_cold_path.cpp):
- sched and orch pool counts computed independently:
sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
orch = orch_to_sched_ ? aicpu_thread_num_ : 1
Python tools:
- swimlane_converter.py and sched_overhead_analysis.py read the
same field names; orch section's phase key now always equals
"orch_submit" (was already the only value). No tool changes
required.
Test plan:
- sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
--enable-l2-swimlane level 4): pass
- sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
- pre-commit clean
Stacked on top of hw-native-sys#941 (PhaseHeader merge).
ffdd557 to
16a36fa
Compare
After #939 (pool unification), #941 (PhaseHeader merge), and #942 (split sched/orch phase records), several comments and doc sections still referenced the pre-split a2a3 layout. Audit and update: a2a3 code/comments: - platform_config.h: PROF_BUFFERS_PER_THREAD doc references both SchedPhaseBuffer and OrchPhaseBuffer (was: single PhaseBuffer); PROF_READYQUEUE_SIZE comment now says "four kinds"; formula bumped by 2x on the per-thread term to cover both sched and orch pool enqueues (matches host alloc which iterates both pool arrays). - l2_swimlane_profiling.h header layout diagram: name the two split phase-thread counts. - l2_swimlane_collector_aicpu.cpp: cross-launch reset comment now references s_sched_phase_pools / s_orch_phase_pools (was: single s_aicpu_phase_pools) and record_sched_phase / record_orch_phase. - scheduler_dispatch.cpp / aicpu_executor.cpp: comments reference the split record types. src/common/ shared comments (now mixed-arch): - profiler_base.h / buffer_pool_manager.h: qualify L2SwimlaneAicpuPhaseHeader::magic example as "on a5" since the struct no longer exists on a2a3. docs/dfx/l2-swimlane-profiling.md: - §5.1: layout block + record list now distinguish a2a3 split shape (SchedPhaseRecord 40B + OrchPhaseRecord 32B, two pool arrays) from a5's still-unified shape (pending port). - §5.2: a2a3 buffer-kind list updated to all four kinds (was: two); ASCII data-flow diagram redrawn to show split phase records; kBufferKinds = 4 in the L2SwimlaneModule trait description. - §5.3 (a5): num_phase_threads / core_to_thread[] reference corrected to live in L2SwimlaneAicpuPhaseHeader on a5 (was wrongly attributed to L2SwimlaneDataHeader). - §5.4: comparison table separates task record (identical) from phase record (diverged); ready-queue and kBufferKinds rows call out the a2a3=4 vs a5=2 split. - §6: overhead description differentiates a2a3's per-emit SchedPhase + per-submit OrchPhase from a5's unified PhaseRecord (was: "4 phases × 40B per iteration", which described a removed shape). - §8 FAQ: "phase records empty" entry gates a2a3 on num_{sched,orch}_phase_threads, a5 on PhaseHeader::magic. No semantic code changes except the READYQUEUE_SIZE formula bump (adds ~8KB to the header; necessary correctness fix given the second phase pool). Test plan: - pre-commit clean - onboard l2_swimlane STs (--enable-l2-swimlane --enable-dep-gen): 2 passed - onboard paged_attention_unroll level 4: 1 passed Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Summary
The unified
L2SwimlaneAicpuPhaseRecord(40B with a union and aphase_iddiscriminator) and the parse-timekAicpuOrchPhaseIdBase = 16range gate were vestiges of when sched and orch records shared one path. This PR splits them cleanly into two record types, two BufferKinds, and two pool arrays — type-tagged at the device-side write, no parse-time discriminator on the host side.Schema
L2SwimlaneSchedPhaseKind(enum)Complete=0,Dispatch=1L2SwimlaneAicpuSchedPhaseRecordstart_time,end_time,loop_iter,kind,tasks_processed(uint32),pop_hit,pop_miss, padL2SwimlaneAicpuOrchPhaseRecordstart_time,end_time,task_id,submit_idx, padL2SwimlaneBufferKindis nowAicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3(AicoreTask shifted from 2 to 3).L2SwimlaneDataHeadercarriesnum_sched_phase_threads + num_orch_phase_threads(replaces the singlenum_phase_threads).calc_perf_data_size_with_phasestakes both counts.Dropped (no compat layer)
L2SwimlaneAicpuPhaseIdenum (and legacy ids2/3/16-24that had been documented as "host parser maps to unknown")L2SwimlaneAicpuPhaseRecord/L2SwimlaneAicpuPhaseBufferkAicpuOrchPhaseIdBase/is_scheduler_phaseroutingrecords_per_threadPhaseHeader field (already dead — never had a reader)AICPU collector changes
s_aicpu_phase_pools[]→s_sched_phase_pools[]+s_orch_phase_pools[]l2_swimlane_aicpu_init_phase(worker_count, num_sched_phase_threads, num_orch_phase_threads)record_phasesplit into:record_sched_phase(thread_idx, kind, start, end, loop_iter, tasks_processed, pop_hit, pop_miss)— named extras, no unionrecord_orch_phase(start, end, task_id, submit_idx)— uses cacheds_orch_thread_idxswitch_phase_buffer+ newacquire_phase_slotgeneralized into kind-parameterized templates shared by both pool typesflush_phase_buffersdrains both pool arrays for the threadHost collector changes
collected_phase_records_split intocollected_sched_phase_records_+collected_orch_phase_records_;total_phase_collected_similarly split for clean per-kind reconcilecopy_phase_buffersplit intocopy_sched_phase_buffer+copy_orch_phase_bufferresolve_entryandfor_each_instanceroute on the four BufferKinds;ProfBufferTypemirrorsphase/loop_iter/tasks_processed/pop_hit/pop_miss. Orch section:phase/submit_idx/task_id/timestamps. The orchphasestring is now hard-coded"orch_submit"since the device type tag is the truth.Caller migrations
record_sched_phasewithL2SwimlaneSchedPhaseKind::Completeor::Dispatch.CYCLE_COUNT_ORCH_SUBMIT_RECORD: drops thephase_idargument; weak fallback signature reduced.sched = orch_to_sched_ ? aicpu : sched,orch = orch_to_sched_ ? aicpu : 1.Python tools
swimlane_converter.pyandsched_overhead_analysis.pyread the same field names; the orchphasekey always equalled"orch_submit"already, so no tool changes needed.Dependency
Stacked on top of
refactor/swimlane-merge-phase-header(#941). The PR base ismain(GitHub won't let us target a fork branch), so the diff temporarily includes both #939 and #941's changes plus this PR's 8 files. After #941 lands, diff auto-cleans to D-only.Test plan
test_l2_swimlane+test_l2_swimlane_mixedat--enable-l2-swimlanelevel 4): passpre-commit runclean