Skip to content

Refactor: split L2 swimlane phase records into sched + orch types#942

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/swimlane-split-phase-records
May 31, 2026
Merged

Refactor: split L2 swimlane phase records into sched + orch types#942
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:refactor/swimlane-split-phase-records

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

Summary

The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate were vestiges of when sched and orch records shared one path. This PR splits them cleanly into two record types, two BufferKinds, and two pool arrays — type-tagged at the device-side write, no parse-time discriminator on the host side.

Schema

Type Size Fields
L2SwimlaneSchedPhaseKind (enum) uint32 Complete=0, Dispatch=1
L2SwimlaneAicpuSchedPhaseRecord 40B start_time, end_time, loop_iter, kind, tasks_processed (uint32), pop_hit, pop_miss, pad
L2SwimlaneAicpuOrchPhaseRecord 32B start_time, end_time, task_id, submit_idx, pad

L2SwimlaneBufferKind is now AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3 (AicoreTask shifted from 2 to 3).

L2SwimlaneDataHeader carries num_sched_phase_threads + num_orch_phase_threads (replaces the single num_phase_threads). calc_perf_data_size_with_phases takes both counts.

Dropped (no compat layer)

  • L2SwimlaneAicpuPhaseId enum (and legacy ids 2/3/16-24 that had been documented as "host parser maps to unknown")
  • L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer
  • kAicpuOrchPhaseIdBase / is_scheduler_phase routing
  • records_per_thread PhaseHeader field (already dead — never had a reader)

AICPU collector changes

  • s_aicpu_phase_pools[]s_sched_phase_pools[] + s_orch_phase_pools[]
  • l2_swimlane_aicpu_init_phase(worker_count, num_sched_phase_threads, num_orch_phase_threads)
  • record_phase split into:
    • record_sched_phase(thread_idx, kind, start, end, loop_iter, tasks_processed, pop_hit, pop_miss) — named extras, no union
    • record_orch_phase(start, end, task_id, submit_idx) — uses cached s_orch_thread_idx
  • switch_phase_buffer + new acquire_phase_slot generalized into kind-parameterized templates shared by both pool types
  • flush_phase_buffers drains both pool arrays for the thread

Host collector changes

  • collected_phase_records_ split into collected_sched_phase_records_ + collected_orch_phase_records_; total_phase_collected_ similarly split for clean per-kind reconcile
  • copy_phase_buffer split into copy_sched_phase_buffer + copy_orch_phase_buffer
  • resolve_entry and for_each_instance route on the four BufferKinds; ProfBufferType mirrors
  • JSON emit: same field names on the wire. Sched section: phase/loop_iter/tasks_processed/pop_hit/pop_miss. Orch section: phase/submit_idx/task_id/timestamps. The orch phase string is now hard-coded "orch_submit" since the device type tag is the truth.

Caller migrations

  • scheduler_dispatch.cpp (3 sites): convert to record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or ::Dispatch.
  • pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD: drops the phase_id argument; weak fallback signature reduced.
  • scheduler_cold_path.cpp: sched and orch pool counts computed independently — sched = orch_to_sched_ ? aicpu : sched, orch = orch_to_sched_ ? aicpu : 1.

Python tools

swimlane_converter.py and sched_overhead_analysis.py read the same field names; the orch phase key always equalled "orch_submit" already, so no tool changes needed.

Dependency

Stacked on top of refactor/swimlane-merge-phase-header (#941). The PR base is main (GitHub won't let us target a fork branch), so the diff temporarily includes both #939 and #941's changes plus this PR's 8 files. After #941 lands, diff auto-cleans to D-only.

Test plan

  • sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at --enable-l2-swimlane level 4): pass
  • sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
  • pre-commit run clean
  • CI green (onboard + sim, a2a3)
  • a5 port: not in this PR (a5 has its own phase layout — separate sweep)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 31, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dec39909-88a8-46cc-8fcf-32a63bcadbb6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR redesigns L2 swimlane profiling across the entire system, replacing a rotation-channel model with an active-head cache-line model and splitting unified phase profiling into separate scheduler and orchestrator streams. Changes span shared-memory contracts, AICore and AICPU device-side implementations, host collection routing, and all runtime call sites.

Changes

L2 Swimlane Profiling Redesign: Rotation Channel to Active Head

Layer / File(s) Summary
Shared data contracts and memory layout
src/a2a3/platform/include/common/l2_swimlane_profiling.h
Introduces L2SwimlaneActiveHead cache-line struct holding current buffer pointer/sequence and accounting; restructures L2SwimlaneAicpuTaskPool and L2SwimlaneAicoreTaskPool around head+free_queue; adds separate L2SwimlaneAicpuSchedPhaseRecord and L2SwimlaneAicpuOrchPhaseRecord types; updates L2SwimlaneBufferKind to split phase into sched/orch kinds; adjusts layout sizing and accessor helpers.
AICore-side profiling accessor API
src/a2a3/platform/include/aicore/aicore_profiling_state.h, l2_swimlane_collector_aicore.h, src/a2a3/platform/onboard/aicore/kernel.cpp, src/a2a3/platform/sim/aicore/kernel.cpp
Replaces set/get_l2_swimlane_aicore_rotation*() with set/get_l2_swimlane_aicore_head_slot/head(); updates L2SwimlaneAicoreLocalState to cache current_buf_seq instead of generation; changes l2_swimlane_aicore_record_task() signature to accept L2SwimlaneActiveHead*; adds lazy-resolution caching in kernel implementations with TLS state.
AICPU phase profiling API refactor
src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h
Splits l2_swimlane_aicpu_record_phase() into l2_swimlane_aicpu_record_sched_phase() and l2_swimlane_aicpu_record_orch_phase() with distinct signatures; updates l2_swimlane_aicpu_init_phase() to accept separate num_sched_phase_threads and num_orch_phase_threads instead of single thread count.
Host-side buffer routing infrastructure
src/a2a3/platform/include/host/l2_swimlane_collector.h
Expands ProfBufferType enum to 4 kinds, separating phase into sched and orch; updates L2SwimlaneModule::resolve_entry() to validate and route buffers by kind with thread-based indexing for phases and core-based for tasks; updates for_each_instance() to enumerate sched and orch phase pools separately.
AICPU device-side initialization and buffer rotation
src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp
Refactors buffer management to use head channel: derives device head-address table from rotation-table base, writes per-core head addresses instead of rotation addresses, initializes and updates state->head.current_buf_ptr/seq during buffer switches with memory barriers, and updates accounting under head.dropped_record_count and head.total_record_count.
Phase profiling implementation: sched/orch separation
src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp
Major refactor replacing single phase pool with separate per-thread scheduler and orchestrator pools; introduces template-based pool priming and slot acquisition helpers; splits recording into l2_swimlane_aicpu_record_sched_phase() and l2_swimlane_aicpu_record_orch_phase() with separate per-thread state and buffer caches.
Host-side phase collection, accounting, and JSON export
src/a2a3/platform/src/host/l2_swimlane_collector.cpp
Splits collected_phase_records_ into collected_sched_phase_records_ and collected_orch_phase_records_; adds distinct copy_sched_phase_buffer() and copy_orch_phase_buffer() drain handlers; updates reconcile_counters() to handle sched and orch kinds separately using head fields; rewrites phase JSON emission with kind-to-string mapping for "complete"/"dispatch" scheduler phases and "orch_submit" orchestrator phases.
Runtime executor integration
src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp, src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
Updates lazy-resolution to call get_l2_swimlane_aicore_head() instead of rotation variant; initializes cached_buf_seq to UINT32_MAX to force first-task cache miss; passes l2_swimlane_head pointer to l2_swimlane_aicore_record_task().
Scheduler and orchestrator phase recording updates
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp, scheduler_cold_path.cpp, src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
Updates scheduler profiling calls to emit L2SwimlaneSchedPhaseKind::{Complete,Dispatch} via l2_swimlane_aicpu_record_sched_phase(); updates phase pool init to compute separate sched and orch thread counts; updates orchestrator submit recording to call new l2_swimlane_aicpu_record_orch_phase() signature without phase-id parameter.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hw-native-sys/simpler#916: Continues the same rotation-to-head channel refactor for L2 swimlane profiling, introducing the initial L2SwimlaneActiveHead structure and API changes that this PR builds upon.

  • hw-native-sys/simpler#932: Related through L2 swimlane symbol binding in the sim DeviceRunner; coordinates dynamic loading of the renamed AICore profiling accessors (set/get_l2_swimlane_aicore_head_slot/head).

Poem

A swimmer glides through new channels wide,
Head-first now, with sequence as guide,
Phases split like left and right,
Orchestrated, scheduled—profiled just right. 🏊‍♂️✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.87% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and concisely summarizes the main refactoring: splitting L2 swimlane phase records into separate scheduler and orchestrator types.
Description check ✅ Passed The PR description is comprehensive and directly related to the changeset, detailing the schema changes, dropped items, code restructuring, and migration paths for all affected components.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request restructures the L2 Swimlane profiling memory layout by splitting the unified phase profiling into separate, dedicated pools and record types for scheduler and orchestrator phases, while also renaming rotation-related structures to a unified active head concept. A critical issue was identified where the orchestrator phase pool offset is calculated using a dynamic runtime parameter instead of the static PLATFORM_MAX_AICPU_THREADS stride used by the host at allocation, which could lead to memory corruption or type confusion. Actionable suggestions have been provided to statically use the maximum thread count as the stride and update all corresponding call sites.

Comment thread src/a2a3/platform/include/common/l2_swimlane_profiling.h
Comment thread src/a2a3/platform/include/host/l2_swimlane_collector.h
Comment thread src/a2a3/platform/include/host/l2_swimlane_collector.h Outdated
Comment thread src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp
Comment thread src/a2a3/platform/src/host/l2_swimlane_collector.cpp
Comment thread src/a2a3/platform/src/host/l2_swimlane_collector.cpp
Comment thread src/a2a3/platform/src/host/l2_swimlane_collector.cpp
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/a2a3/platform/include/common/l2_swimlane_profiling.h (1)

409-419: 💤 Low value

Consider adding alignment attribute to L2SwimlaneAicpuSchedPhaseRecord.

Unlike other record types in this file (e.g., L2SwimlaneAicpuTaskRecord with __attribute__((aligned(64))), L2SwimlaneAicoreTaskRecord with __attribute__((aligned(32)))), this 40-byte struct has no alignment attribute. While the 40B size isn't a power of two and won't naturally align to cache lines, the lack of explicit alignment may cause suboptimal memory access patterns when records are stored in arrays.

If cache-line alignment is intentional (for consistency with the template buffer), consider adding __attribute__((aligned(64))) or at minimum documenting why alignment is omitted here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a2a3/platform/include/common/l2_swimlane_profiling.h` around lines 409 -
419, The L2SwimlaneAicpuSchedPhaseRecord struct lacks an explicit alignment like
the other record types; add a matching alignment attribute (e.g.,
__attribute__((aligned(64)))) to struct L2SwimlaneAicpuSchedPhaseRecord and
update the static_assert for sizeof(L2SwimlaneAicpuSchedPhaseRecord) to the new
aligned size (64) so the layout check remains correct; alternatively, if
alignment was intentionally omitted, add a comment explaining why alignment
differs from L2SwimlaneAicpuTaskRecord and L2SwimlaneAicoreTaskRecord.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 855-857: The sched_phase_threads value can be zero when
sched_thread_num_ <= 0 which causes l2_swimlane_aicpu_init_phase to prime no
scheduler pools; change the computation to normalize sched_thread_num_ to the
active AICPU count when non-positive (i.e., use aicpu_thread_num_ if
sched_thread_num_ <= 0) before computing sched_phase_threads and calling
l2_swimlane_aicpu_init_phase so that scheduler phase pools are sized from the
normalized active scheduler thread count; update the calculation around
sched_phase_threads (and keep orch_phase_threads/orch_to_sched_ logic intact)
and ensure l2_swimlane_aicpu_record_sched_phase() will no longer be dropped.

---

Nitpick comments:
In `@src/a2a3/platform/include/common/l2_swimlane_profiling.h`:
- Around line 409-419: The L2SwimlaneAicpuSchedPhaseRecord struct lacks an
explicit alignment like the other record types; add a matching alignment
attribute (e.g., __attribute__((aligned(64)))) to struct
L2SwimlaneAicpuSchedPhaseRecord and update the static_assert for
sizeof(L2SwimlaneAicpuSchedPhaseRecord) to the new aligned size (64) so the
layout check remains correct; alternatively, if alignment was intentionally
omitted, add a comment explaining why alignment differs from
L2SwimlaneAicpuTaskRecord and L2SwimlaneAicoreTaskRecord.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 036d1da2-9e9b-44dc-b62f-81a61544172c

📥 Commits

Reviewing files that changed from the base of the PR and between a536a2a and 984cfad.

📒 Files selected for processing (14)
  • src/a2a3/platform/include/aicore/aicore_profiling_state.h
  • src/a2a3/platform/include/aicore/l2_swimlane_collector_aicore.h
  • src/a2a3/platform/include/aicpu/l2_swimlane_collector_aicpu.h
  • src/a2a3/platform/include/common/l2_swimlane_profiling.h
  • src/a2a3/platform/include/host/l2_swimlane_collector.h
  • src/a2a3/platform/onboard/aicore/kernel.cpp
  • src/a2a3/platform/sim/aicore/kernel.cpp
  • src/a2a3/platform/src/aicpu/l2_swimlane_collector_aicpu.cpp
  • src/a2a3/platform/src/host/l2_swimlane_collector.cpp
  • src/a2a3/runtime/host_build_graph/aicore/aicore_executor.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp

@ChaoWao ChaoWao force-pushed the refactor/swimlane-split-phase-records branch 8 times, most recently from c5c4d97 to ffdd557 Compare May 31, 2026 10:50
The unified L2SwimlaneAicpuPhaseRecord (40B with a union and a phase_id
discriminator) and the parse-time kAicpuOrchPhaseIdBase = 16 range gate
were a vestige of when sched and orch records shared one path. Schema
splits them cleanly into two record types, two BufferKinds, and two
pool arrays — type-tagged at the device-side write, no parse-time
discriminator on the host side.

Schema (header):
  - L2SwimlaneSchedPhaseKind { Complete=0, Dispatch=1 }
  - L2SwimlaneAicpuSchedPhaseRecord (40B):
      start_time, end_time, loop_iter, kind, tasks_processed (uint32),
      pop_hit, pop_miss, pad
  - L2SwimlaneAicpuOrchPhaseRecord (32B):
      start_time, end_time, task_id, submit_idx, pad
  - L2SwimlaneBufferKind:
      AicpuTask=0, AicpuSchedPhase=1, AicpuOrchPhase=2, AicoreTask=3
      (AicoreTask shifted from 2 to 3 to accommodate the split)
  - L2SwimlaneDataHeader carries num_sched_phase_threads +
    num_orch_phase_threads (replaces the single num_phase_threads).
  - calc_perf_data_size_with_phases takes both counts; new
    get_sched_phase_buffer_state / get_orch_phase_buffer_state helpers
    replace get_phase_buffer_state.

Dropped (no compat layer):
  - L2SwimlaneAicpuPhaseId enum, including legacy ids 2/3/16-24 that
    had been documented as "host parser maps to unknown".
  - L2SwimlaneAicpuPhaseRecord / L2SwimlaneAicpuPhaseBuffer.
  - kAicpuOrchPhaseIdBase / is_scheduler_phase routing.
  - The unused `records_per_thread` PhaseHeader field never had a
    reader.

AICPU collector:
  - s_aicpu_phase_pools[] split into s_sched_phase_pools[] +
    s_orch_phase_pools[]; same for current-buffer caches.
  - l2_swimlane_aicpu_init_phase now takes (worker_count,
    num_sched_phase_threads, num_orch_phase_threads).
  - record_phase split into record_sched_phase (with kind +
    pop_hit/pop_miss named, not extras) and record_orch_phase
    (task_id + submit_idx).
  - switch_phase_buffer + acquire_phase_slot generalized into kind-
    parameterized templates shared by both pool types.
  - flush_phase_buffers drains both pool arrays for the thread.

Host collector:
  - collected_phase_records_ split into collected_sched_phase_records_
    and collected_orch_phase_records_; total_phase_collected_
    similarly split for clean reconcile per kind.
  - copy_phase_buffer split into copy_sched_phase_buffer +
    copy_orch_phase_buffer.
  - resolve_entry and for_each_instance route on the four
    BufferKinds; ProfBufferType mirrors.
  - JSON emit unchanged on the wire: sched section still has
    "phase"/"loop_iter"/"tasks_processed"/"pop_hit"/"pop_miss"; orch
    section still has "phase"/"submit_idx"/"task_id"/timestamps
    (phase string is now hard-coded "orch_submit" since the type tag
    is the truth).

Scheduler call sites:
  - scheduler_dispatch.cpp's three record_phase sites convert to
    record_sched_phase with L2SwimlaneSchedPhaseKind::Complete or
    ::Dispatch.

Orchestrator call site:
  - pto_orchestrator.cpp CYCLE_COUNT_ORCH_SUBMIT_RECORD drops the
    L2SwimlaneAicpuPhaseId::ORCH_SUBMIT argument; the weak fallback
    signature is reduced to (uint64_t, uint64_t, uint64_t, uint32_t).

Header thread count (scheduler_cold_path.cpp):
  - sched and orch pool counts computed independently:
      sched = orch_to_sched_ ? aicpu_thread_num_ : sched_thread_num_
      orch  = orch_to_sched_ ? aicpu_thread_num_ : 1

Python tools:
  - swimlane_converter.py and sched_overhead_analysis.py read the
    same field names; orch section's phase key now always equals
    "orch_submit" (was already the only value). No tool changes
    required.

Test plan:
  - sim swimlane ST (test_l2_swimlane + test_l2_swimlane_mixed at
    --enable-l2-swimlane level 4): pass
  - sim DFX (scope_stats / tensor_dump / pmu / dep_gen): pass
  - pre-commit clean

Stacked on top of hw-native-sys#941 (PhaseHeader merge).
@ChaoWao ChaoWao force-pushed the refactor/swimlane-split-phase-records branch from ffdd557 to 16a36fa Compare May 31, 2026 11:24
@ChaoWao ChaoWao merged commit c4d1005 into hw-native-sys:main May 31, 2026
15 checks passed
@ChaoWao ChaoWao deleted the refactor/swimlane-split-phase-records branch May 31, 2026 11:47
ChaoWao added a commit that referenced this pull request May 31, 2026
After #939 (pool unification), #941 (PhaseHeader merge), and #942
(split sched/orch phase records), several comments and doc sections
still referenced the pre-split a2a3 layout. Audit and update:

a2a3 code/comments:
- platform_config.h: PROF_BUFFERS_PER_THREAD doc references both
  SchedPhaseBuffer and OrchPhaseBuffer (was: single PhaseBuffer);
  PROF_READYQUEUE_SIZE comment now says "four kinds"; formula bumped
  by 2x on the per-thread term to cover both sched and orch pool
  enqueues (matches host alloc which iterates both pool arrays).
- l2_swimlane_profiling.h header layout diagram: name the two split
  phase-thread counts.
- l2_swimlane_collector_aicpu.cpp: cross-launch reset comment now
  references s_sched_phase_pools / s_orch_phase_pools (was: single
  s_aicpu_phase_pools) and record_sched_phase / record_orch_phase.
- scheduler_dispatch.cpp / aicpu_executor.cpp: comments reference
  the split record types.

src/common/ shared comments (now mixed-arch):
- profiler_base.h / buffer_pool_manager.h: qualify
  L2SwimlaneAicpuPhaseHeader::magic example as "on a5" since the
  struct no longer exists on a2a3.

docs/dfx/l2-swimlane-profiling.md:
- §5.1: layout block + record list now distinguish a2a3 split shape
  (SchedPhaseRecord 40B + OrchPhaseRecord 32B, two pool arrays) from
  a5's still-unified shape (pending port).
- §5.2: a2a3 buffer-kind list updated to all four kinds (was: two);
  ASCII data-flow diagram redrawn to show split phase records;
  kBufferKinds = 4 in the L2SwimlaneModule trait description.
- §5.3 (a5): num_phase_threads / core_to_thread[] reference corrected
  to live in L2SwimlaneAicpuPhaseHeader on a5 (was wrongly attributed
  to L2SwimlaneDataHeader).
- §5.4: comparison table separates task record (identical) from
  phase record (diverged); ready-queue and kBufferKinds rows
  call out the a2a3=4 vs a5=2 split.
- §6: overhead description differentiates a2a3's per-emit
  SchedPhase + per-submit OrchPhase from a5's unified PhaseRecord
  (was: "4 phases × 40B per iteration", which described a removed
  shape).
- §8 FAQ: "phase records empty" entry gates a2a3 on
  num_{sched,orch}_phase_threads, a5 on PhaseHeader::magic.

No semantic code changes except the READYQUEUE_SIZE formula bump
(adds ~8KB to the header; necessary correctness fix given the second
phase pool).

Test plan:
- pre-commit clean
- onboard l2_swimlane STs (--enable-l2-swimlane --enable-dep-gen): 2 passed
- onboard paged_attention_unroll level 4: 1 passed

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants