Skip to content

[WIP] A5 performance opt#906

Open
poursoul wants to merge 15 commits into
mainfrom
a5-performance-opt
Open

[WIP] A5 performance opt#906
poursoul wants to merge 15 commits into
mainfrom
a5-performance-opt

Conversation

@poursoul
Copy link
Copy Markdown
Collaborator

No description provided.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 30, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5e8db49d-6041-4d36-b6f9-ebf340eb1de7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR introduces the M5 cross-cluster performance recovery approach through conditional SM publishing and a dedicated wiring thread, refactors platform register access to use inline headers for better performance, implements CPU-deterministic thread role assignment, adds lifecycle profiling to task slots, and optimizes paged attention kernel and orchestration patterns.

Changes

M5 Cross-Cluster Scheduler and Platform Infrastructure

Layer / File(s) Summary
M5 Design Documentation
M5_DESIGN.md
Design document describes three coordinated mechanisms: short-circuiting sync_to_sm() for non-local schedulers, sched-0 conditional publish loop, and tensor-map cleanup compensation. Includes measured results, failure iterations, and enablement conditions.
Platform Register Access Refactoring
src/a5/platform/include/aicpu/platform_regs.h, src/a5/platform/onboard/aicpu/inner_platform_regs.h, src/a5/platform/onboard/aicpu/inner_platform_regs.cpp, src/a5/platform/sim/aicpu/inner_platform_regs.h
Register read/write operations moved from out-of-line .cpp functions to inline header implementations for both onboard and simulation platforms. Platform regs header updated to document the include-based dispatch pattern.
CPU-Affinity-Based Thread Role Assignment
src/a5/platform/include/aicpu/platform_aicpu_affinity.h, src/a5/platform/onboard/aicpu/platform_aicpu_affinity.cpp, src/a5/platform/sim/aicpu/platform_aicpu_affinity.cpp, src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
Replaces cluster-topology inference with explicit CPU allowlist matching for deterministic thread role assignment. AicpuExecutor uses platform_aicpu_affinity_thread_idx() for role classification with updated thread layout computation for degenerate cases.
Task Lifecycle Profiling Fields
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h, src/a5/platform/include/common/l2_perf_profiling.h, src/a5/platform/include/common/platform_config.h
Extends PTO2TaskSlotState with fanin_zero_cycles and enter_global_queue_cycles profiling timestamps. Updates L2PerfRecord struct and increases platform max thread count to 8.
Scheduler SM Publishing and Dedicated Wiring Thread
src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h, src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h, src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
Implements conditional SM publishing via throttled sync_to_sm() (publishes only when advancing by K=16), introduces new wiring_thread_run() entry point for dedicated wiring work, and updates ring pointer advancement using atomic CAS lock. Tasks now stamp lifecycle timestamps when becoming ready.
Scheduler Dispatch and Completion Optimizations
src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp, src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp, src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h
Optimizes COND register polling with precomputed pointers and direct dereferencing. Adds fast-path in deferred completion. Restricts mailbox polling to selected scheduler threads. Introduces comprehensive DIAG_DISPATCH_RAMP diagnostic instrumentation.
Performance Profiling Collection Pipeline
src/a5/platform/include/aicpu/l2_perf_collector_aicpu.h, src/a5/platform/src/aicpu/l2_perf_collector_aicpu.cpp, src/a5/platform/src/host/l2_perf_collector.cpp, src/a5/runtime/host_build_graph/aicpu/aicpu_executor.cpp, src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
Extends profiling functions to capture fanin_zero_time and enter_global_queue_time. Updates phase buffer initialization with num_total_aicpu_threads. Exports new timing fields to JSON. Cold-path logging extended for diagnostic emission.

Paged Attention Kernel and Orchestration Optimization

Layer / File(s) Summary
QK Matmul Double-Buffered Computation
examples/a5/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/kernels/aic/aic_qk_matmul.cpp
Implements double-buffered L1 storage for QK matmul B-tile computation. Preloads first kj tile, alternates buffers each iteration, and prefetches next kj during compute overlap. Adds explicit pipe_barrier(PIPE_ALL) to ensure datapath completion.
Two-Pass Orchestration Pattern
examples/a5/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/kernels/orchestration/paged_attention_orch.cpp, tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/kernels/orchestration/paged_attention_orch.cpp
Refactors orchestration from depth-first per-(b,q) to two-pass MANUAL-scope model. Pass 1 pre-emits all QK tasks and records outputs/metadata. Pass 2 emits SF/PV/UP consuming pre-recorded QK outputs. Enables better task batching and scheduler visibility.
Test Configuration Updates
examples/a5/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/test_paged_attention_unroll.py, tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py
Updates aicpu_thread_num from 4 to 6 in both example and test configurations to align with M5 thread layout.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Poem

🐰 Through clusters we hop, with queues aligned so bright,
SM publishes throttled, wiring threads set right,
Buffers double-dance in dual-tile delight,
Profiling timestamps track every task's flight!

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning No pull request description was provided by the author; the PR lacks any explanation of its purpose, changes, or rationale. Add a detailed pull request description explaining the objectives, key changes, and testing performed for this multi-file performance optimization changeset.
Docstring Coverage ⚠️ Warning Docstring coverage is 45.61% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title '[WIP] A5 performance opt' is vague and generic, using non-descriptive abbreviations without conveying the specific changes made. Replace with a descriptive title that summarizes the main change, such as 'Add M5 design doc and optimize ring publishing with profiling support' or similar.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the M5 optimization design, which restructures the runtime execution with a dedicated wiring thread, role-based CPU affinity mapping, and precomputed register pointers to minimize cross-cluster cache-invalidation overhead. It also implements double-buffered prefetching in the QK matmul kernel and a two-pass task emission strategy in the orchestration layer. The review feedback highlights several critical issues: a mismatch in the 'ALLOWED_CPUS' affinity mask that would drop active threads, a missing read memory barrier ('rmb()') on ARM64 after reading the MMIO 'COND' register, potential stack buffer overflows in the orchestration files if task counts exceed 'MAX_BQ', and a potential hang risk for small workloads due to the batched 'sync_to_sm' threshold.

// surviving thread by its CANN-dispatched cpu_id (we do NOT call
// sched_setaffinity), so cpu 12 / cpu 14 must already be in CANN's launch
// set for this layout to populate all 4 sched slots.
static constexpr int32_t ALLOWED_CPUS[] = {7, 8, 3, 4, 5, 6};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical mismatch between the ALLOWED_CPUS array and the placement design described in the comments. The comment specifies that the orchestrator is on cpu 9, the wiring thread is on cpu 7, and the 4 schedulers are on cpu 11, 12, 13, 14. However, the ALLOWED_CPUS array is defined as {7, 8, 3, 4, 5, 6}. This mismatch will cause the orchestrator (cpu 9) and all schedulers (cpu 11, 12, 13, 14) to be incorrectly dropped by the affinity gate, completely breaking the M5 runtime execution on real hardware. Please update the array to match the intended CPU IDs.

Suggested change
static constexpr int32_t ALLOWED_CPUS[] = {7, 8, 3, 4, 5, 6};
static constexpr int32_t ALLOWED_CPUS[] = {11, 12, 13, 14, 7, 9};

// read_reg() function call + reg_offset() switch + add per poll. The
// pointer was set up once during handshake (see scheduler_cold_path.cpp
// assign loop next to core_exec_states_[i].reg_addr).
uint64_t reg_val = static_cast<uint64_t>(*cond_ptrs_[core_id]);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

On weakly-ordered architectures like ARM64, a read memory barrier is required between reading the MMIO COND register and subsequently reading Normal memory (such as the task/slot states or deferred slabs in complete_slot_task). Without an explicit barrier, the CPU can speculatively load stale data from Normal memory before the completion flag in the COND register is actually verified, leading to potential data corruption or intermittent functional failures.

        uint64_t reg_val = static_cast<uint64_t>(*cond_ptrs_[core_id]);
        rmb();
References
  1. On weakly-ordered architectures (such as ARM64), ensure that a read memory barrier (e.g., rmb()) is explicitly placed between an MMIO read (Device memory, such as a status or completion register) and a subsequent Normal memory read (such as a data slot) when there is no data or address dependency between them. This prevents the CPU from speculatively loading stale data before the completion flag is verified.

Comment on lines +224 to +227
sij_buf_arr[n_bq].set(qk_outs.get_ref(0));
n_blocks_arr[n_bq] = n_blocks;
valid_len_arr[n_bq] = valid_len_last;
n_bq++;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

There is a potential stack buffer overflow vulnerability here. If the number of emitted QK tasks (n_bq) exceeds MAX_BQ (256) due to larger input parameters, writing to sij_buf_arr, n_blocks_arr, and valid_len_arr will result in out-of-bounds memory access and stack corruption. We should defensively clamp the count or check bounds to ensure we do not exceed the maximum capacity of the local buffers.

                    if (n_bq < MAX_BQ) {
                        sij_buf_arr[n_bq].set(qk_outs.get_ref(0));
                        n_blocks_arr[n_bq] = n_blocks;
                        valid_len_arr[n_bq] = valid_len_last;
                        n_bq++;
                    }
References
  1. Defensively clamp counts or sizes read from shared memory to the maximum capacity of local buffers to prevent stack overflows caused by potential memory corruption or builder bypasses.

Comment on lines +219 to +223
sij_buf_arr[n_bq].set(qk_outs.get_ref(0));
qk_task_ids[n_bq] = qk_outs.task_id();
n_blocks_arr[n_bq] = n_blocks;
valid_len_arr[n_bq] = valid_len_last;
n_bq++;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

There is a potential stack buffer overflow vulnerability here. If the number of emitted QK tasks (n_bq) exceeds MAX_BQ (256) due to larger input parameters, writing to sij_buf_arr, qk_task_ids, n_blocks_arr, and valid_len_arr will result in out-of-bounds memory access and stack corruption. We should defensively clamp the count or check bounds to ensure we do not exceed the maximum capacity of the local buffers.

                    if (n_bq < MAX_BQ) {
                        sij_buf_arr[n_bq].set(qk_outs.get_ref(0));
                        qk_task_ids[n_bq] = qk_outs.task_id();
                        n_blocks_arr[n_bq] = n_blocks;
                        valid_len_arr[n_bq] = valid_len_last;
                        n_bq++;
                    }
References
  1. Defensively clamp counts or sizes read from shared memory to the maximum capacity of local buffers to prevent stack overflows caused by potential memory corruption or builder bypasses.

Comment on lines +612 to +617
void sync_to_sm() {
constexpr int32_t PUBLISH_INTERVAL_K = 16;
if (last_task_alive - last_published_to_sm < PUBLISH_INTERVAL_K) return;
ring->fc.last_task_alive.store(last_task_alive, std::memory_order_release);
last_published_to_sm = last_task_alive;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Throttling sync_to_sm to PUBLISH_INTERVAL_K = 16 can cause hangs or timeouts if the orchestrator or host does synchronous waiting on task completion and the number of tasks is less than 16. While wiring_thread_run has a final force-publish when the entire run completes, any intermediate synchronous waits or smaller workloads (where total_tasks < 16) will hang indefinitely because the completed tasks are never published to the shared memory. Consider adding a fallback mechanism to publish when the counter has been stale for a certain period or if the remaining task count is small.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/a5/platform/src/host/l2_perf_collector.cpp (1)

554-599: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Include the new lifecycle timestamps in the JSON time origin.

base_time_cycles is still computed only from start_time, dispatch_time, and phase starts. fanin_zero_time and enter_global_queue_time can happen earlier than both, so the new subtraction can underflow and emit huge bogus timestamps.

Suggested fix
     for (const auto &tagged : tagged_records) {
         if (tagged.record->start_time < base_time_cycles) {
             base_time_cycles = tagged.record->start_time;
         }
         if (tagged.record->dispatch_time > 0 && tagged.record->dispatch_time < base_time_cycles) {
             base_time_cycles = tagged.record->dispatch_time;
         }
+        if (tagged.record->fanin_zero_time > 0 && tagged.record->fanin_zero_time < base_time_cycles) {
+            base_time_cycles = tagged.record->fanin_zero_time;
+        }
+        if (tagged.record->enter_global_queue_time > 0 &&
+            tagged.record->enter_global_queue_time < base_time_cycles) {
+            base_time_cycles = tagged.record->enter_global_queue_time;
+        }
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/src/host/l2_perf_collector.cpp` around lines 554 - 599, The
base time origin computation (base_time_cycles) only checks start_time,
dispatch_time and phase pr.start_time but must also consider
record.fanin_zero_time and record.enter_global_queue_time to avoid underflow;
update the initial loop over tagged_records and the has_phase_data_ loop to also
compare record.fanin_zero_time and record.enter_global_queue_time (only when >0)
against base_time_cycles and set base_time_cycles accordingly, using the same
pattern as for dispatch_time and start_time so all timestamps are anchored
consistently.
src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp (1)

299-341: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stamp enter_global_queue_cycles on these direct requeue paths.

Tasks that were still scheduler-local up to this point keep enter_global_queue_cycles == 0, and these pushes bypass flush_local_bufs(), which is the only place that fixes that up. The new lifecycle profiling will therefore report “never entered global queue” for sync-start or partially claimed tasks that actually spill to the global queue here.

♻️ Proposed fix
+            auto push_global_ready = [&](PTO2TaskSlotState *s) {
+                if (s->enter_global_queue_cycles == 0) {
+                    s->enter_global_queue_cycles = get_sys_cnt_aicpu();
+                }
+                sched_->ready_queues[static_cast<int32_t>(shape)].push(s);
+            };
+
             if (slot_state->active_mask.requires_sync_start()) {
                 if (is_pending) {
-                    sched_->ready_queues[static_cast<int32_t>(shape)].push(slot_state);
+                    push_global_ready(slot_state);
                     continue;
                 }
                 int32_t available = cores.count();
                 if (available < slot_state->logical_block_num) {
                     if (!enter_drain_mode(slot_state, slot_state->logical_block_num)) {
-                        sched_->ready_queues[static_cast<int32_t>(shape)].push(slot_state);
+                        push_global_ready(slot_state);
                     }
                     for (int rem = bi + 1; rem < got; rem++) {
-                        sched_->ready_queues[static_cast<int32_t>(shape)].push(batch[rem]);
+                        push_global_ready(batch[rem]);
                     }
                     entered_drain = true;
                     break;
                 }
             }
@@
             if (slot_state->next_block_idx < slot_state->logical_block_num) {
-                sched_->ready_queues[static_cast<int32_t>(shape)].push(slot_state);
+                push_global_ready(slot_state);
             }
🧹 Nitpick comments (2)
src/a5/platform/src/aicpu/l2_perf_collector_aicpu.cpp (1)

402-406: ⚡ Quick win

Fail fast instead of silently truncating num_total_aicpu_threads.

The new parameter is supposed to cover every phase-emitting exec index. Clamping it here leaves the higher thread indices uninitialized, which turns an unsupported layout into missing phase data later. Reject the config explicitly instead of masking it.

Proposed fix
     int total_threads =
         (num_total_aicpu_threads > 0) ? num_total_aicpu_threads : num_sched_threads + 1;
     if (total_threads > PLATFORM_MAX_AICPU_THREADS) {
-        total_threads = PLATFORM_MAX_AICPU_THREADS;
+        LOG_ERROR(
+            "Phase profiling thread count %d exceeds PLATFORM_MAX_AICPU_THREADS=%d",
+            total_threads, PLATFORM_MAX_AICPU_THREADS
+        );
+        return;
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/src/aicpu/l2_perf_collector_aicpu.cpp` around lines 402 -
406, Currently the code silently clamps num_total_aicpu_threads into
total_threads when it exceeds PLATFORM_MAX_AICPU_THREADS; instead validate and
fail fast: in the block that computes total_threads (using symbols
num_total_aicpu_threads, num_sched_threads, total_threads) check if
num_total_aicpu_threads > PLATFORM_MAX_AICPU_THREADS and if so return an error
or log and abort configuration load (do not overwrite or clamp), otherwise
compute total_threads as before; ensure any caller of this function handles the
error path so an unsupported config is rejected rather than producing missing
phase data.
src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h (1)

397-398: ⚡ Quick win

Make DIAG_DISPATCH_RAMP overrideable from the build.

This header hard-resets the macro to 0, so a global -DDIAG_DISPATCH_RAMP=1 or an earlier TU-level define gets ignored. Wrapping the default in #ifndef keeps the current behavior while preserving the advertised compile-time switch.

Proposed fix
-#define DIAG_DISPATCH_RAMP 0
+#ifndef DIAG_DISPATCH_RAMP
+#define DIAG_DISPATCH_RAMP 0
+#endif
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h`
around lines 397 - 398, The macro DIAG_DISPATCH_RAMP is being unconditionally
set to 0 which prevents users from overriding it via -D flags; change the header
to only define a default when not already defined (i.e., replace the
unconditional "`#define` DIAG_DISPATCH_RAMP 0" with a guarded default using
`#ifndef` DIAG_DISPATCH_RAMP / `#define` DIAG_DISPATCH_RAMP 0 / `#endif`) so that an
external -DDIAG_DISPATCH_RAMP=1 or a prior TU-level define will take effect
while preserving the current default behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@examples/a5/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/kernels/orchestration/paged_attention_orch.cpp`:
- Around line 161-173: The local fixed-size arrays (MAX_BQ, sij_buf_arr,
qk_task_ids, n_blocks_arr, valid_len_arr) can overflow because n_bq can grow
beyond 256; change to allocate sized-by-input (e.g., compute required_bq from
batch, q_loop, ceil(bn_this_batch/N_UNROLL)) and use a dynamic container
(std::vector or new[]) or, if dynamic allocation is unacceptable, add a bounds
check before every n_bq++ that fails fast (throw/log and return) when n_bq >=
MAX_BQ; update all places that push into these arrays (references to n_bq,
sij_buf_arr, qk_task_ids, n_blocks_arr, valid_len_arr in the PTO2_SCOPE /
pre-emit QKs code) to use the new sized storage or the fail-fast check.

In `@M5_DESIGN.md`:
- Around line 187-225: The documentation still pins M5 to aicpu_thread_num=4 and
old src/a2a3 paths; update the M5_DESIGN.md text to reflect the new
implementation under src/a5 and the added wiring-thread role: replace the
aicpu_thread_num=4 claim with the new thread-role layout (mention the dedicated
wiring-thread), change all referenced file names/locations to the new src/a5
counterparts, and adjust examples/notes that point to former symbols so they
reference the actual implemented symbols (PTO2_CLUSTER_MODE=5, wiring-thread
role, try_cleanup_all_rings_from_sm, g_only_thread0_advances,
pto_scheduler/pto_tensormap/scheduler_dispatch equivalents) so debugging and
repro lead to the correct code.

In `@src/a5/platform/include/aicpu/platform_aicpu_affinity.h`:
- Around line 21-25: Update the public comment describing the executor index
mapping so callers know the wiring slot is at ALLOWED_CPU_COUNT-2: state that
the highest index (ALLOWED_CPU_COUNT-1) is the orchestrator, the next-highest
index (ALLOWED_CPU_COUNT-2) is the wiring slot, and all lower indices [0 ..
ALLOWED_CPU_COUNT-3] are scheduler slots in declaration order of ALLOWED_CPUS
(use ALLOWED_CPU_COUNT and ALLOWED_CPUS identifiers in the comment to make the
mapping explicit).

In `@src/a5/platform/onboard/aicpu/platform_aicpu_affinity.cpp`:
- Around line 90-97: The code indexes fixed-size arrays s_thread_cpu,
s_thread_survive, and s_thread_exec_idx using total_launched without validating
it against MAX_GATE_THREADS; add a bounds check wherever you iterate to
total_launched (the initialization loop and the scan loop) to ensure
total_launched <= MAX_GATE_THREADS and either clamp iterations to
MAX_GATE_THREADS or return/log an error and abort the launch; use
MAX_GATE_THREADS as the upper bound for loop counters before indexing those
arrays and prefer an explicit check (if (total_launched > MAX_GATE_THREADS) { /*
handle error or cap */ }) to prevent buffer overruns.
- Around line 98-131: platform_aicpu_affinity_gate currently ignores the
logical_count parameter and marks every thread on any ALLOWED_CPUS as surviving;
update the function (platform_aicpu_affinity_gate) to honor logical_count by
computing an allowed_target = min(logical_count, ALLOWED_CPU_COUNT) and only
assigning up to allowed_target execution slots: when scanning threads
(s_thread_cpu) and comparing to ALLOWED_CPUS, increment allowed_cnt and set
s_thread_survive[tid] and s_thread_exec_idx[tid] only while allowed_cnt <
allowed_target (assign exec indices in 0..allowed_target-1), and stop assigning
once allowed_target is reached so extra matching threads are not marked
surviving or given exec indices.

In `@src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp`:
- Around line 195-201: The current branch uses aicpu_thread_num_ >= 2 to reserve
a wiring thread, which for aicpu_thread_num_ == 2 yields sched_thread_num_ == 0
and leaves thread 0 without a role; change the condition to require >= 3 so that
for 2-thread configs sched_thread_num_ is set to aicpu_thread_num_ - 1 (i.e., 1)
and the sched can take over wiring as intended. Update the same logic at the
other occurrence (the block around lines 241-242) so both places compute
sched_thread_num_ using "if (aicpu_thread_num_ >= 3) sched_thread_num_ =
aicpu_thread_num_ - 2; else sched_thread_num_ = aicpu_thread_num_ - 1;" ensuring
thread roles derived from is_wiring, is_orch, and thread_idx are correct.

In `@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h`:
- Around line 598-617: The batched publish in sync_to_sm() can leave SM cursor
stale indefinitely; change sync_to_sm() to publish immediately when staleness
exceeds a maximum fallback threshold: keep the existing PUBLISH_INTERVAL_K
fast-path but add a secondary condition (e.g., a MAX_STALE_INTERVAL or
timestamp-based max_stale_ms) that forces
ring->fc.last_task_alive.store(last_task_alive, std::memory_order_release) and
updates last_published_to_sm even if (last_task_alive - last_published_to_sm) <
PUBLISH_INTERVAL_K; locate sync_to_sm(), PUBLISH_INTERVAL_K, last_task_alive,
last_published_to_sm and the ring->fc.last_task_alive.store call to implement
the fallback and ensure the fallback threshold is configurable and documented in
the comment.

In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`:
- Around line 1011-1050: The wiring thread exits its main loop when completed_
flips but that can happen before late on_task_release() flushes finish, so final
advance_ring_pointers() and SM publish can miss CONSUMED transitions; to fix,
introduce a pending_release counter (incremented in on_task_release) or an
equivalent drain flag in the scheduler and make the cold-path wait for that
counter to reach zero (or call a new sched method like
wait_for_post_release_drain()) before performing the final per-ring advance and
forced SM publish; update references in the wiring code that call
drain_wiring_queue(), the completed_ check, and the final loop to wait on
pending_release == 0 (or use sched_->wait_for_post_release_drain()) and ensure
on_task_release decrements the counter so the final advance_ring_pointers(),
rss.ring->fc.last_task_alive.store(...) and rss.last_published_to_sm are
executed only after all post-loop releases are drained.

In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`:
- Around line 649-658: The current is_mailbox_poller check hard-codes
"thread_idx >= 2" and can yield zero pollers when sched_thread_num_ <= 2; update
the gating so at least one mailbox poller exists for small-thread layouts:
change the is_mailbox_poller computation in scheduler_dispatch.cpp to select
thread_idx == 0 when sched_thread_num_ <= 2, otherwise keep the existing
(thread_idx >= 2 && thread_idx < sched_thread_num_) behavior; keep the rest of
the guard that checks rt_, rt_->aicore_mailbox and async_wait_list
(count/pending_completion_count) so tasks waiting on
async_wait_list.poll_and_complete() are always served.

---

Outside diff comments:
In `@src/a5/platform/src/host/l2_perf_collector.cpp`:
- Around line 554-599: The base time origin computation (base_time_cycles) only
checks start_time, dispatch_time and phase pr.start_time but must also consider
record.fanin_zero_time and record.enter_global_queue_time to avoid underflow;
update the initial loop over tagged_records and the has_phase_data_ loop to also
compare record.fanin_zero_time and record.enter_global_queue_time (only when >0)
against base_time_cycles and set base_time_cycles accordingly, using the same
pattern as for dispatch_time and start_time so all timestamps are anchored
consistently.

---

Nitpick comments:
In `@src/a5/platform/src/aicpu/l2_perf_collector_aicpu.cpp`:
- Around line 402-406: Currently the code silently clamps
num_total_aicpu_threads into total_threads when it exceeds
PLATFORM_MAX_AICPU_THREADS; instead validate and fail fast: in the block that
computes total_threads (using symbols num_total_aicpu_threads,
num_sched_threads, total_threads) check if num_total_aicpu_threads >
PLATFORM_MAX_AICPU_THREADS and if so return an error or log and abort
configuration load (do not overwrite or clamp), otherwise compute total_threads
as before; ensure any caller of this function handles the error path so an
unsupported config is rejected rather than producing missing phase data.

In `@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h`:
- Around line 397-398: The macro DIAG_DISPATCH_RAMP is being unconditionally set
to 0 which prevents users from overriding it via -D flags; change the header to
only define a default when not already defined (i.e., replace the unconditional
"`#define` DIAG_DISPATCH_RAMP 0" with a guarded default using `#ifndef`
DIAG_DISPATCH_RAMP / `#define` DIAG_DISPATCH_RAMP 0 / `#endif`) so that an external
-DDIAG_DISPATCH_RAMP=1 or a prior TU-level define will take effect while
preserving the current default behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2d31e45c-3292-4405-8de5-704d7308439f

📥 Commits

Reviewing files that changed from the base of the PR and between ab989e0 and 1cf1689.

📒 Files selected for processing (27)
  • M5_DESIGN.md
  • examples/a5/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/kernels/aic/aic_qk_matmul.cpp
  • examples/a5/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/kernels/orchestration/paged_attention_orch.cpp
  • examples/a5/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/test_paged_attention_unroll.py
  • src/a5/platform/include/aicpu/l2_perf_collector_aicpu.h
  • src/a5/platform/include/aicpu/platform_aicpu_affinity.h
  • src/a5/platform/include/aicpu/platform_regs.h
  • src/a5/platform/include/common/l2_perf_profiling.h
  • src/a5/platform/include/common/platform_config.h
  • src/a5/platform/onboard/aicpu/inner_platform_regs.cpp
  • src/a5/platform/onboard/aicpu/inner_platform_regs.h
  • src/a5/platform/onboard/aicpu/platform_aicpu_affinity.cpp
  • src/a5/platform/sim/aicpu/inner_platform_regs.h
  • src/a5/platform/sim/aicpu/platform_aicpu_affinity.cpp
  • src/a5/platform/src/aicpu/l2_perf_collector_aicpu.cpp
  • src/a5/platform/src/host/l2_perf_collector.cpp
  • src/a5/runtime/host_build_graph/aicpu/aicpu_executor.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_completion.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h
  • tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/kernels/orchestration/paged_attention_orch.cpp
  • tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/test_paged_attention_unroll.py
💤 Files with no reviewable changes (1)
  • src/a5/platform/onboard/aicpu/inner_platform_regs.cpp

Comment on lines +161 to +173
static constexpr size_t MAX_BQ = 256;
struct TensorSlot {
alignas(alignof(Tensor)) unsigned char buf[sizeof(Tensor)];
Tensor &get() { return *reinterpret_cast<Tensor *>(buf); }
void set(const Tensor &src) { new (buf) Tensor(src); }
};

PTO2_SCOPE(PTO2ScopeMode::MANUAL) { // outer scope: pre-emit QKs
TensorSlot sij_buf_arr[MAX_BQ];
PTO2TaskId qk_task_ids[MAX_BQ];
uint64_t n_blocks_arr[MAX_BQ];
uint64_t valid_len_arr[MAX_BQ];
size_t n_bq = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Guard the pre-emitted QK slot arrays against overflow.

MAX_BQ is fixed at 256, but n_bq grows with batch * q_loop * ceil(bn_this_batch / N_UNROLL). Once that exceeds 256, these writes run past sij_buf_arr, qk_task_ids, n_blocks_arr, and valid_len_arr. The updated 256-batch case already consumes every slot, so any larger batch or extra (q,bn) partition will corrupt the stack. Please size this storage from the input bounds or fail fast before n_bq++.

Also applies to: 219-223

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@examples/a5/tensormap_and_ringbuffer/paged_attention_unroll_manual_scope/kernels/orchestration/paged_attention_orch.cpp`
around lines 161 - 173, The local fixed-size arrays (MAX_BQ, sij_buf_arr,
qk_task_ids, n_blocks_arr, valid_len_arr) can overflow because n_bq can grow
beyond 256; change to allocate sized-by-input (e.g., compute required_bq from
batch, q_loop, ceil(bn_this_batch/N_UNROLL)) and use a dynamic container
(std::vector or new[]) or, if dynamic allocation is unacceptable, add a bounds
check before every n_bq++ that fails fast (throw/log and return) when n_bq >=
MAX_BQ; update all places that push into these arrays (references to n_bq,
sij_buf_arr, qk_task_ids, n_blocks_arr, valid_len_arr in the PTO2_SCOPE /
pre-emit QKs code) to use the new sized storage or the fail-fast check.

Comment thread M5_DESIGN.md
Comment on lines +187 to +225
启用 `PTO2_CLUSTER_MODE=5` 的前提:

1. AICPU 拓扑允许 2+2 分布(每 cluster ≥ 2 个 active 线程)
2. CANN 9.0+(其它版本 affinity gate 行为可能不同)
3. `aicpu_thread_num=4`(其它配置的 thread_idx/role 分配未验证)

在 A3 当前 CANN 上(4+2 ideal),M5 实际上**没必要** —— M0 default 已经把 4 个全放同 cluster,没跨 cluster 开销。M5 是为**未来受限拓扑**准备的解。

## 已知限制

- **sched 0 的 publish 是 lazy 的**:依赖 sched 0 iter 的频率。如果 sched 0 因为各种原因 stall(如长时间没 ready task),SM 会大幅滞后,orch 的 spin 会变长。实测在 paged_attention 上没观察到,但 workload 不同可能暴露。
- **C2 Total 仍比 M0 高 3.6%**:剩余开销主要在 sched 侧,远端 2 个 sched 仍然在 cluster B 上跑完成处理 + 跟 sched 0 同步。这部分是 M5 无法解的本质开销(必须有线程在远端)。
- **`try_cleanup_all_rings_from_sm` 跑全部 4 个 ring**:实际 paged_attention 只用 1 ring,3 次无效循环。可以加 ring 数缓存优化,但影响小。
- **`g_only_thread0_advances` 是 thread_local**:每个 sched 线程都要在 `run()` 入口设置。当前 `aicpu_executor::run()` 已经做了。

## 后续工作

1. 把 `try_cleanup_all_rings_from_sm` 改成 lazy(仅当 pool 真用尽才扫所有 ring)
2. sched 0 publish 可以再加 epoch counter,避免长时间没事时永远不 publish 的极端情况
3. 在真正 3+3 受限硬件上验证 —— A3 现在的 4+2 拓扑只能模拟

## 相关文件

```text
src/a2a3/runtime/tensormap_and_ringbuffer/
├── runtime/
│ ├── scheduler/
│ │ ├── pto_scheduler.h # sync_to_sm 短路;g_only_thread0_advances 声明
│ │ ├── pto_scheduler.cpp # g_only_thread0_advances 定义
│ │ └── scheduler_dispatch.cpp # sched 0 conditional publish
│ ├── pto_tensormap.h # new_entry spin
│ └── shared/
│ └── pto_tensormap.cpp # try_cleanup_all_rings_from_sm + cap
├── aicpu/
│ └── aicpu_executor.cpp # g_only_thread0_advances 赋值
├── runtime/
│ └── runtime.h # cluster_mode 字段
└── host/
└── runtime_maker.cpp # PTO2_CLUSTER_MODE env 读取
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the documented topology and file references.

This section still anchors M5 to aicpu_thread_num=4 and src/a2a3/..., but the supplied implementation in this PR is under src/a5/... and adds a dedicated wiring-thread role. Leaving the old paths/layout here will send debugging and repro work to the wrong code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@M5_DESIGN.md` around lines 187 - 225, The documentation still pins M5 to
aicpu_thread_num=4 and old src/a2a3 paths; update the M5_DESIGN.md text to
reflect the new implementation under src/a5 and the added wiring-thread role:
replace the aicpu_thread_num=4 claim with the new thread-role layout (mention
the dedicated wiring-thread), change all referenced file names/locations to the
new src/a5 counterparts, and adjust examples/notes that point to former symbols
so they reference the actual implemented symbols (PTO2_CLUSTER_MODE=5,
wiring-thread role, try_cleanup_all_rings_from_sm, g_only_thread0_advances,
pto_scheduler/pto_tensormap/scheduler_dispatch equivalents) so debugging and
repro lead to the correct code.

Comment on lines +21 to +25
// Deterministic executor index assigned to the surviving thread by the gate.
// Range [0, ALLOWED_CPU_COUNT-1]; -1 if this thread was dropped or the gate
// was bypassed (ALLOWED_CPU_COUNT >= total_launched).
// Highest index (ALLOWED_CPU_COUNT-1) is the orchestrator slot; lower indices
// are scheduler slots in declaration order of ALLOWED_CPUS.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Document the wiring slot in the public contract.

The comment says the highest index is orch and all lower indices are schedulers, but the supplied executor now treats ALLOWED_CPU_COUNT - 2 as the wiring slot. Please reflect that here so callers do not misclassify the returned index.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/include/aicpu/platform_aicpu_affinity.h` around lines 21 -
25, Update the public comment describing the executor index mapping so callers
know the wiring slot is at ALLOWED_CPU_COUNT-2: state that the highest index
(ALLOWED_CPU_COUNT-1) is the orchestrator, the next-highest index
(ALLOWED_CPU_COUNT-2) is the wiring slot, and all lower indices [0 ..
ALLOWED_CPU_COUNT-3] are scheduler slots in declaration order of ALLOWED_CPUS
(use ALLOWED_CPU_COUNT and ALLOWED_CPUS identifiers in the comment to make the
mapping explicit).

Comment on lines +90 to 97
static constexpr int32_t MAX_GATE_THREADS = 16;
static std::atomic<int32_t> s_cpu_written{0};
static std::atomic<int32_t> s_gate_init{0};
static std::atomic<int32_t> s_gate_ready{0};

static int32_t s_thread_cpu[MAX_GATE_THREADS];
static bool s_thread_survive[MAX_GATE_THREADS];
static int32_t s_thread_exec_idx[MAX_GATE_THREADS];

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate total_launched before indexing the fixed gate buffers.

s_thread_cpu, s_thread_survive, and s_thread_exec_idx are all sized to MAX_GATE_THREADS, but the initialization and scan loops run to total_launched unconditionally. Any launch count above 16 will walk past those arrays and corrupt shared gate state.

Also applies to: 118-124

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/onboard/aicpu/platform_aicpu_affinity.cpp` around lines 90 -
97, The code indexes fixed-size arrays s_thread_cpu, s_thread_survive, and
s_thread_exec_idx using total_launched without validating it against
MAX_GATE_THREADS; add a bounds check wherever you iterate to total_launched (the
initialization loop and the scan loop) to ensure total_launched <=
MAX_GATE_THREADS and either clamp iterations to MAX_GATE_THREADS or return/log
an error and abort the launch; use MAX_GATE_THREADS as the upper bound for loop
counters before indexing those arrays and prefer an explicit check (if
(total_launched > MAX_GATE_THREADS) { /* handle error or cap */ }) to prevent
buffer overruns.

Comment on lines +98 to +131
bool platform_aicpu_affinity_gate(int32_t /*logical_count*/, int32_t total_launched) {
if (ALLOWED_CPU_COUNT >= total_launched) {
tl_exec_idx = -1;
return true;
}

// Assign thread index
int32_t idx = s_reported.fetch_add(1, std::memory_order_acq_rel);

// Report CPU
#if defined(__aarch64__)
int32_t cpu = sched_getcpu();
#elif defined(__x86_64__)
#if defined(__aarch64__) || defined(__x86_64__)
int32_t cpu = sched_getcpu();
#else
int32_t cpu = -1;
#endif
LOG_INFO_V0("AICPU affinity gate: thread idx=%d sched_getcpu=%d", idx, cpu);

int32_t normalized_cpu = -1;
if (cpu >= 0) {
if (cpu < 63) {
s_cpumask.fetch_or(1ULL << cpu, std::memory_order_release);
}
normalized_cpu = cpu % AICPU_CORES_PER_CHIP;
}
if (idx < MAX_GATE_THREADS) {
s_thread_cpu[idx] = normalized_cpu;
}
if (idx < MAX_GATE_THREADS) s_thread_cpu[idx] = cpu;
s_cpu_written.fetch_add(1, std::memory_order_release);
while (s_cpu_written.load(std::memory_order_acquire) < total_launched) {}

// Barrier: wait until all total_launched threads have reported
while (popcount64(s_cpumask.load(std::memory_order_acquire)) < total_launched &&
s_reported.load(std::memory_order_acquire) < total_launched) {}

// CAS winner does cluster classification
int32_t expected = 0;
if (s_gate_init.compare_exchange_strong(expected, 1, std::memory_order_acq_rel, std::memory_order_acquire)) {
// Initialize survive flags
for (int32_t i = 0; i < total_launched; ++i) {
s_thread_survive[i] = false;
s_thread_exec_idx[i] = -1;
}

struct ClusterInfo {
int32_t count{0};
int32_t tids[MAX_GATE_THREADS];
};
ClusterInfo clusters[MAX_CLUSTERS];

int32_t allowed_cnt = 0;
for (int32_t tid = 0; tid < total_launched; ++tid) {
int32_t c = s_thread_cpu[tid];
if (c < 0) continue;
int32_t cluster_id = c / CPUS_PER_CLUSTER;
if (cluster_id < 0 || cluster_id >= MAX_CLUSTERS) continue;
ClusterInfo &info = clusters[cluster_id];
if (info.count < MAX_GATE_THREADS) info.tids[info.count++] = tid;
for (int32_t a = 0; a < ALLOWED_CPU_COUNT; ++a) {
if (c == ALLOWED_CPUS[a]) {
s_thread_survive[tid] = true;
s_thread_exec_idx[tid] = a;
allowed_cnt++;
break;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor logical_count when deciding which threads survive.

platform_aicpu_affinity_gate() now survives every thread whose CPU matches ALLOWED_CPUS, even when the runtime requested fewer active threads. That can hand AicpuExecutor exec indices outside its configured sched/wiring/orch layout, leaving extra threads with no role but still participating in shutdown bookkeeping.

Suggested fix
-bool platform_aicpu_affinity_gate(int32_t /*logical_count*/, int32_t total_launched) {
+bool platform_aicpu_affinity_gate(int32_t logical_count, int32_t total_launched) {
@@
-            for (int32_t a = 0; a < ALLOWED_CPU_COUNT; ++a) {
+            for (int32_t a = 0; a < ALLOWED_CPU_COUNT; ++a) {
                 if (c == ALLOWED_CPUS[a]) {
-                    s_thread_survive[tid] = true;
-                    s_thread_exec_idx[tid] = a;
-                    allowed_cnt++;
+                    if (a < logical_count) {
+                        s_thread_survive[tid] = true;
+                        s_thread_exec_idx[tid] = a;
+                        allowed_cnt++;
+                    }
                     break;
                 }
             }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/onboard/aicpu/platform_aicpu_affinity.cpp` around lines 98 -
131, platform_aicpu_affinity_gate currently ignores the logical_count parameter
and marks every thread on any ALLOWED_CPUS as surviving; update the function
(platform_aicpu_affinity_gate) to honor logical_count by computing an
allowed_target = min(logical_count, ALLOWED_CPU_COUNT) and only assigning up to
allowed_target execution slots: when scanning threads (s_thread_cpu) and
comparing to ALLOWED_CPUS, increment allowed_cnt and set s_thread_survive[tid]
and s_thread_exec_idx[tid] only while allowed_cnt < allowed_target (assign exec
indices in 0..allowed_target-1), and stop assigning once allowed_target is
reached so extra matching threads are not marked surviving or given exec
indices.

Comment on lines +195 to +201
if (aicpu_thread_num_ >= 2) {
sched_thread_num_ = aicpu_thread_num_ - 2;
} else {
// Degenerate config (no wiring thread room) — fall back to old layout
// for back-compat; the sched takes over wiring inside its main loop.
sched_thread_num_ = aicpu_thread_num_ - 1;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The “no wiring slot” fallback drops one thread in 2-thread configs.

When aicpu_thread_num_ == 2, this computes sched_thread_num_ = 0, is_wiring = false, and is_orch = true only for thread_idx == 1. That leaves thread 0 with no role at all, which contradicts the back-compat comment and turns a 2-thread run into a single active worker.

Suggested fix
-    if (aicpu_thread_num_ >= 2) {
+    if (aicpu_thread_num_ >= 3) {
         sched_thread_num_ = aicpu_thread_num_ - 2;
     } else {
         // Degenerate config (no wiring thread room) — fall back to old layout
         // for back-compat; the sched takes over wiring inside its main loop.
         sched_thread_num_ = aicpu_thread_num_ - 1;
@@
-    const bool is_orch = (thread_idx == aicpu_thread_num_ - 1);
-    const bool is_wiring = (aicpu_thread_num_ >= 3) && (thread_idx == aicpu_thread_num_ - 2);
+    const bool is_wiring = (aicpu_thread_num_ >= 3) && (thread_idx == sched_thread_num_);
+    const bool is_orch =
+        (thread_idx == ((aicpu_thread_num_ >= 3) ? (aicpu_thread_num_ - 1) : sched_thread_num_));

Also applies to: 241-242

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp` around
lines 195 - 201, The current branch uses aicpu_thread_num_ >= 2 to reserve a
wiring thread, which for aicpu_thread_num_ == 2 yields sched_thread_num_ == 0
and leaves thread 0 without a role; change the condition to require >= 3 so that
for 2-thread configs sched_thread_num_ is set to aicpu_thread_num_ - 1 (i.e., 1)
and the sched can take over wiring as intended. Update the same logic at the
other occurrence (the block around lines 241-242) so both places compute
sched_thread_num_ using "if (aicpu_thread_num_ >= 3) sched_thread_num_ =
aicpu_thread_num_ - 2; else sched_thread_num_ = aicpu_thread_num_ - 1;" ensuring
thread roles derived from is_wiring, is_orch, and thread_idx are correct.

Comment on lines +598 to +617
// directa-pubonly: K=16 batched publish.
//
// Each advance still pushes the local `last_task_alive` counter
// forward, but the SM cache line is only stored when local has
// advanced PUBLISH_INTERVAL_K positions past the most recently
// published value. This cuts cache-invalidate traffic on orch's
// SM read path by ~K× without changing total work.
//
// Safety: for paged_attention_unroll Case1, task_window_size
// (16384) >> task_count (1024), so orch backpressure on slot
// reuse never fires. PTO2_TENSORMAP_POOL_SIZE (65536) >> 1280
// also rules out tensormap cleanup pressure. Workloads that
// approach those bounds need an explicit fallback (publish even
// with delta < K when stale long enough). Not present here.
void sync_to_sm() {
constexpr int32_t PUBLISH_INTERVAL_K = 16;
if (last_task_alive - last_published_to_sm < PUBLISH_INTERVAL_K) return;
ring->fc.last_task_alive.store(last_task_alive, std::memory_order_release);
last_published_to_sm = last_task_alive;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Add a fallback when batched SM publishing gets too stale.

This now throttles last_task_alive publication unconditionally, and the comment already states the implementation is only safe for the current paged-attention headroom and that the fallback is “not present here.” On workloads that get closer to slot-reuse or cleanup pressure, reclaimable slots can stay hidden behind a stale SM cursor and block orchestrator progress.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h`
around lines 598 - 617, The batched publish in sync_to_sm() can leave SM cursor
stale indefinitely; change sync_to_sm() to publish immediately when staleness
exceeds a maximum fallback threshold: keep the existing PUBLISH_INTERVAL_K
fast-path but add a secondary condition (e.g., a MAX_STALE_INTERVAL or
timestamp-based max_stale_ms) that forces
ring->fc.last_task_alive.store(last_task_alive, std::memory_order_release) and
updates last_published_to_sm even if (last_task_alive - last_published_to_sm) <
PUBLISH_INTERVAL_K; locate sync_to_sm(), PUBLISH_INTERVAL_K, last_task_alive,
last_published_to_sm and the ring->fc.last_task_alive.store call to implement
the fallback and ensure the fallback threshold is configurable and documented in
the comment.

Comment on lines +1011 to +1050
while (!completed_.load(std::memory_order_acquire)) {
// 1. Drain orch wiring queue (force-drain after orch_done so any
// last-second submissions don't stall).
sched_->drain_wiring_queue(orchestrator_done_);

// 2. Advance per-ring last_task_alive past CONSUMED slots and publish
// to SM. Wiring is the sole writer; the advance_lock CAS still
// happens for now (cheap when uncontended) — followup commit can
// drop it entirely.
for (int32_t r = 0; r < PTO2_MAX_RING_DEPTH; r++) {
auto &rss = sched_->ring_sched_states[r];
if (rss.ring == nullptr) continue;
int32_t expected = 0;
if (rss.advance_lock.compare_exchange_strong(
expected, 1, std::memory_order_acquire, std::memory_order_relaxed
)) {
rss.advance_ring_pointers();
rss.advance_lock.store(0, std::memory_order_release);
}
}
}

// Final drain pass: orch may have finished but a few tasks may still be
// waiting to be advanced/reclaimed for the SM final publish. Once
// completed_ is set there's no more work to wire, but flush the last
// advances so orch sees the final last_task_alive.
for (int32_t r = 0; r < PTO2_MAX_RING_DEPTH; r++) {
auto &rss = sched_->ring_sched_states[r];
if (rss.ring == nullptr) continue;
int32_t expected = 0;
if (rss.advance_lock.compare_exchange_strong(
expected, 1, std::memory_order_acquire, std::memory_order_relaxed
)) {
rss.advance_ring_pointers();
// Force a final SM publish regardless of the K=16 throttle so the
// orchestrator (next round / shutdown) sees the latest counter.
rss.ring->fc.last_task_alive.store(rss.last_task_alive, std::memory_order_release);
rss.last_published_to_sm = rss.last_task_alive;
rss.advance_lock.store(0, std::memory_order_release);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep the wiring thread alive until post-loop releases are drained.

completed_ flips on task-completion count, but scheduler threads still run their tail on_task_release() flush after breaking out of resolve_and_dispatch(). Those late releases can transition more slots to CONSUMED after the final advance_ring_pointers() pass here, leaving last_task_alive and the forced SM publish permanently behind.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp`
around lines 1011 - 1050, The wiring thread exits its main loop when completed_
flips but that can happen before late on_task_release() flushes finish, so final
advance_ring_pointers() and SM publish can miss CONSUMED transitions; to fix,
introduce a pending_release counter (incremented in on_task_release) or an
equivalent drain flag in the scheduler and make the cold-path wait for that
counter to reach zero (or call a new sched method like
wait_for_post_release_drain()) before performing the final per-ring advance and
forced SM publish; update references in the wiring code that call
drain_wiring_queue(), the completed_ check, and the final loop to wait on
pending_release == 0 (or use sched_->wait_for_post_release_drain()) and ensure
on_task_release decrements the counter so the final advance_ring_pointers(),
rss.ring->fc.last_task_alive.store(...) and rss.last_published_to_sm are
executed only after all post-loop releases are drained.

Comment on lines +649 to 658
// directa-stage2: restrict aicore_mailbox polling to remote-cluster
// scheds (exec_idx 2/3 in directa 1+4: cluster 3 on die 1). Local
// cluster scheds (exec_idx 0/1, cluster 2 — same cluster as orch)
// skip the mailbox entirely. This keeps the mailbox cache line
// resident in the remote cluster, removes 2 of 4 readers from the
// try_lock contention, and frees the local-cluster scheds to focus
// on dispatch and ring advance.
bool is_mailbox_poller = (thread_idx >= 2 && thread_idx < sched_thread_num_);
if (is_mailbox_poller && rt_ != nullptr && rt_->aicore_mailbox != nullptr &&
(sched_->async_wait_list.count > 0 || sched_->async_wait_list.pending_completion_count > 0)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Leave at least one mailbox poller in non-directa or small-thread layouts.

This gate hard-codes the current thread_idx >= 2 layout into generic scheduler logic. With 1-2 scheduler threads it selects zero pollers, so any task waiting on async_wait_list.poll_and_complete() can stall forever.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`
around lines 649 - 658, The current is_mailbox_poller check hard-codes
"thread_idx >= 2" and can yield zero pollers when sched_thread_num_ <= 2; update
the gating so at least one mailbox poller exists for small-thread layouts:
change the is_mailbox_poller computation in scheduler_dispatch.cpp to select
thread_idx == 0 when sched_thread_num_ <= 2, otherwise keep the existing
(thread_idx >= 2 && thread_idx < sched_thread_num_) behavior; keep the rest of
the guard that checks rt_, rt_->aicore_mailbox and async_wait_list
(count/pending_completion_count) so tasks waiting on
async_wait_list.poll_and_complete() are always served.

Comment on lines +168 to +181
static constexpr size_t MAX_BQ = 256;
struct TensorSlot {
alignas(alignof(Tensor)) unsigned char buf[sizeof(Tensor)];
Tensor &get() { return *reinterpret_cast<Tensor *>(buf); }
void set(const Tensor &src) { new (buf) Tensor(src); }
};

PTO2_SCOPE() { // outer scope: pre-emit QKs
// Per-(b,q) carry: QK output + n_blocks + valid_len_last needed
// by SF in pass 2.
TensorSlot sij_buf_arr[MAX_BQ];
uint64_t n_blocks_arr[MAX_BQ];
uint64_t valid_len_arr[MAX_BQ];
size_t n_bq = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the fixed 256-entry ceiling from this mirrored two-pass storage.

This helper has the same hard cap as the manual-scope variant, but the number of saved QK outputs still scales with batch * q_loop * ceil(bn_this_batch / N_UNROLL). Any test case above today's exact-256 workload will write past sij_buf_arr, n_blocks_arr, and valid_len_arr. Please keep this copy dynamically sized or add the same bound check here too.

Also applies to: 224-227

majin0824 and others added 15 commits May 30, 2026 10:10
Wires up the M5 1 orch + 4 sched topology end-to-end so paged_attention_unroll
can be benchmarked with per-segment orchestrator timing.

Why: prior gate used cpu_id/4 die grouping that is wrong under SMT, and
PTO2_ORCH_PROFILING was off by default so phase-level orch breakdown was
unavailable. Need a reproducible 1+4 baseline before iterating on M5
sync-to-SM placement.

How to apply: this is a baseline branch for benchmarking only — do not merge
into main without revisiting PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH (now
14, may affect other tests) and the hardcoded ALLOWED_CPUS table (only valid
on the Full SKU we measured).

Changes:
- affinity gate rewritten to allowlist style (matches simpler_wc pattern)
  with ALLOWED_CPUS={7,8,3,5,9} for 1 orch + 4 sched on Full A5 device 0;
  sched 0/1 SMT-pair on phy 4 in orch_cluster, sched 2/3 on remote die
  cluster 1, orch alone on phy 5 in orch_cluster
- platform_aicpu_affinity_thread_idx() added so M5 sync logic can branch
  on deterministic exec_idx
- PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH 7 -> 14 to force CANN to
  dispatch to all worker threads (gate cannot reliably hit 5 allowlist
  cpu_ids when only 7 of 14 workers are invoked)
- PTO2_ORCH_PROFILING default 0 -> 1 to emit Thread X orchestrator
  phase breakdown in device debug logs
- paged_attention_unroll Case1 aicpu_thread_num 4 -> 5 to match the
  5-entry allowlist
- M5_DESIGN.md captures the design rationale (PG variant tables, gate
  algorithm, M5 sync intent)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Change ALLOWED_CPUS from {7, 8, 3, 5, 9} to {7, 8, 11, 13, 9}.

sched 2: cpu 3 (die 0 cluster 1 phy 2 ht 0) → cpu 11 (die 1 cluster 3 phy 6 ht 0)
sched 3: cpu 5 (die 0 cluster 1 phy 3 ht 0) → cpu 13 (die 1 cluster 3 phy 7 ht 0)

Rationale: every sched↔orch sync line in the baseline 1+4 went cross-die
(sched 2/3 wrote SM `last_task_alive`, polled SM `current_task_index`,
polled `aicore_mailbox` — all coherence loads/stores on the dispatch
loop's critical path). Moving both to die-1 cluster 3 collapses those
to within-die cross-cluster (~3× faster on Ascend).

Cost: AICore mapping is NOT remapped, so sched 2/3 now dispatch
cross-die to die-0 AICores via MMIO (write-buffered, async). The trade
is "sync-coherence cross-die" → "async-MMIO cross-die".

Open question for stage 2: aicore_mailbox is shared (one ring buffer
in SM); die-0 AICores writing it produces cross-die snoops when
sched 2/3 (now on die 1) poll. If P4 becomes the new bottleneck,
mailbox would need per-die partitioning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…urement

Mirrors the noprof toggle used to measure baseline / mainscheduler /
ideal in the 100-round trimmed-mean comparison. PTO2_PROFILING=1 stays
on so orch_cost / sched_cost (Level 1) are still emitted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In directa-stage1 (single-die 1+4, all sched on die 1), every sched
still polled aicore_mailbox each loop iteration via poll_and_complete.
That kept all 4 threads contending on the try_lock and bouncing the
mailbox cache line across cluster 2 and cluster 3 — even though we
moved the scheds to a common die.

Restrict the poll to remote-cluster scheds (exec_idx 2/3 = cluster 3
in directa). Local-cluster scheds (exec_idx 0/1 = cluster 2, same
cluster as orch) skip the entire poll-and-complete block:

- try_lock contention drops from 4 readers to 2
- mailbox cache line settles in cluster 3 instead of bouncing
- local sched freed from mailbox memory accesses every iteration
- AICore writes (some cross-die from die-0 AICores, some local from
  die-1 AICores) all land in a single owner cluster

Pre-req: aicpu_executor.cpp now derives thread_idx from
platform_aicpu_affinity_thread_idx() (CPU-deterministic exec_idx) so
the gate predicate maps reliably to cluster.

If a remote-cluster sched falls behind on draining, completions back
up in the mailbox ring. Mailbox capacity should comfortably absorb
1024-task workloads; back-pressure surfaces would need explicit
testing on larger graphs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t change

Only-change vs directa-stage2: throttle RingSchedState::sync_to_sm()
to publish SM `last_task_alive` once per K=16 local advances.

- pto_scheduler.h: RingSchedState gets `int32_t last_published_to_sm{0}`
  (lock-protected: only touched while advance_lock CAS is held).
  sync_to_sm early-returns when delta < 16.

No proxy publish, no main thread, no placement change. Whoever
advances ring r at the K=16 boundary writes SM (could be sched 0..3
depending on who completes the corresponding tasks). orch's SM read
path stays mixed (rings 0/1 cluster 2 local, rings 2/3 cluster 3
within-die cross-cluster) but each line is invalidated ~16× less
frequently — slashes the cache-bouncing rate without moving cross-
cluster work onto the critical sched threads.

For paged_attention_unroll Case1: 256 advances per ring / 16 = 16
publishes per ring (clean multiple). No trailing-tail visibility
problem. Larger workloads or smaller task_window_size cases would
need an explicit fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…no async completions

paged_attention_unroll's kernels never call register_completion_condition,
so the per-task register_deferred call always returns NotDeferred after a
wasted CAS on async_wait_list.busy. With 4 sched threads bursting into
register_deferred at task completion, the do { try_lock } while (Skipped)
loop serialises across a cross-cluster shared cache line — each
contended CAS is not just ~150ns of snoop, it's snoop + serialised
re-try across 3 losing sched threads.

Fast-path checks the per-core slab count + the global wait list and
pending-completion counts. When all three are zero, the CAS is skipped.
Race-safe: AICore publishes slab.count BEFORE the FIN state the sched
reads via the COND register, so deferred_count==0 here means the kernel
never registered for this task.

100r noprof @ block_dim=24:
  max sched_cost  1423.38us -> 1288.54us  (-134.8us, -9.5%)
  orch_cost       728.85us -> 729.27us    (noise)
  per-thread     T0/1/2/3 all drop ~133us uniformly

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… enter_global_queue)

Adds two AICPU-side timestamps to every task's L2 perf record so post-hoc
analysis can isolate where time goes between "orch finished setting up" and
"AICore actually starts running":

  fanin_zero_us         when the fanin counter hit 0 (task became ready)
  enter_global_queue_us when the task first entered a globally-visible queue
                        (0 = sched-local buf consumed by same sched directly)

Together with the existing dispatch_time_us / start_time_us / end_time_us /
finish_time_us fields, this lets a script break the post-orch latency into
six clean stages per task:

  L1 orch_proc        orch_fanin.end − orch_alloc.start
  L2 dep_wait         fanin_zero − orch_fanin.end       (wiring queue + dep)
  L3a local_buf_stage enter_global_queue − fanin_zero   (0 if direct-consume)
  L3b global_q_wait   dispatch − enter_global_queue
  L4 head_oh          AIC start − dispatch
  L5 aic_exec         AIC end − start
  L6 fin_observe      sched-observed FIN − AIC end

Instrumentation points:
- PTO2TaskSlotState gains two uint64_t cycle counters in the cold (second)
  cache line; the hot 64-byte cache line layout is preserved.
- wire_task stamps both fields when fanin is already satisfied at orch wiring
  (direct push to global ready queue, used by no-fanin tasks like QK).
- release_fanin_and_check_ready stamps fanin_zero on the last-producer path;
  enter_global_queue is stamped only if the push fell through to the global
  queue (local_buf full or local_bufs == nullptr).
- flush_local_bufs in dispatch_ready_tasks stamps enter_global_queue for any
  task whose release path only set fanin_zero (was held in sched-local buf
  until this point).
- L2PerfRecord gains the two fields at the end of the struct, after fanout[],
  so the AICore staging-ring layout — start_time / end_time / task_id — keeps
  its existing offsets and the dcci(SINGLE_CACHE_LINE) invariant holds.
- l2_perf_aicpu_complete_record signature carries the new timestamps from
  scheduler_completion.cpp through to the host JSON exporter.

Overhead: two unconditional stores on the cold completion path, no hot-path
change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in diagnostic probe inside the sched main loop that captures
scheduler ramp-up behaviour and classifies "ready + idle + no dispatch"
stall events. Compile-gated by DIAG_DISPATCH_RAMP (default 0); when off
all probe code compiles out.

Per-sched probe (#if DIAG_DISPATCH_RAMP):
- Samples first 128 iters per sched: ts, ready(M/A/V), idle(M/A/V),
  pend(A/V), did_dispatch — dumped at sched_end via LOG_INFO_V9.
- Three iter-bucket counters: blocked (ready+idle but no dispatch),
  starved (no ready but idle), saturated (ready but no idle).
- Stall classification on iters where (any_idle_aic_global > 0
  && ready_aic > 0 && !dispatched_aic): drain / complete_long /
  pop_race / mix_residual / cross-sched-view (my idle = 0).
- Peer-idle PENDING gate accounting: how often the gate fires with
  my-pending-free > 0 and ready_aic > 0.
- Pop-race counter inside dispatch_shape AIC-IDLE: queue.size() > 0
  observed, pop_batch returns 0 (a peer drained first).

Why this exists / what we learned:
- Used to validate the peer-idle PENDING gate. Gate-OFF run showed
  50% AIC tasks shift to PENDING path with ~22us head_overhead each;
  removing the gate regressed wall by +61us. Gate is correct as-is.
- Stall classification confirmed ~1.4% of sched iters hit the
  "ready+idle+no-dispatch" condition, dominated by pop_race (53%) and
  cross-sched-view (37%), with R1/R2/R5 effectively zero on the
  paged_attention_unroll workload.

Probe lives in three files but its data structures (DiagRampState,
DiagRampSample) are declared in scheduler_types.h so the cold-path
dumper can reach them. Storage is a single
DiagRampState[MAX_AICPU_THREADS] in scheduler_dispatch.cpp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, stub)

Reshuffles the A5 die-1 placement to introduce a dedicated wiring thread.
The thread is launched but does no work yet — sched 0 still drains the
wiring queue. Step 1 just validates the placement, role identification,
and CANN dispatch coverage of the new cpu set; subsequent commits will
move drain_wiring_queue / advance_ring_pointers / sync_to_sm /
dep_pool.reclaim into the wiring thread.

New ALLOWED_CPUS layout (die 1, 6 threads):

  cluster 2 (orch + wiring, graph state cluster-local):
    cpu 9 (phy 5 ht 0) → orch
    cpu 7 (phy 4 ht 0) → wiring          [new role]

  cluster 3 (4 schedulers, 2 SMT pairs):
    cpu 11 (phy 6 ht 0) → sched 0
    cpu 12 (phy 6 ht 1) → sched 1        [SMT pair w/ sched 0]
    cpu 13 (phy 7 ht 0) → sched 2
    cpu 14 (phy 7 ht 1) → sched 3        [SMT pair w/ sched 2]

Confirmed on device 0: cpu 12 and cpu 14 are in CANN's natural dispatch
set under PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH=7 (14 worker threads
spawn, 6 survive the gate by cpu match). The affinity gate now classifies
exec_idx == N-1 as orch, exec_idx == N-2 as wiring, and the rest as
schedulers.

Runtime dispatch (aicpu_executor.cpp):
- sched_thread_num_ = aicpu_thread_num_ - 2 when aicpu_thread_num_ >= 3
  (subtract both orch and wiring); falls back to -1 in degenerate configs
  for back-compat.
- Three-way role classification: is_orch / is_wiring / sched.
- Wiring branch is a stub: waits for runtime_init_ready_, then spins on
  SPIN_WAIT_HINT until sched_ctx_.is_completed(). No work moved yet.

Case1 aicpu_thread_num bumped from 5 to 6 to claim the new wiring slot.

Net result, 100r noprof @ bd36 (paged_attention_unroll Case1):
  max sched_cost  1247.32us -> 1224.75us   (-22.6us, -1.8%)
  orch_cost        767.40us ->  796.32us   (+29us, noise/cluster pressure)
  per-thread sched_cost  1244-1248us -> 1221-1224us  (4 schedulers now
                                                       uniform to within 3us)

The improvement comes purely from collapsing inter-sched coherence
traffic (has_idle peer reads, ready_queue dequeue CAS, async_wait_list
busy bit) into a single cluster — sched 0/1 (cluster 2) ↔ sched 2/3
(cluster 3) cross-cluster snoop chain is gone. Wiring stub on cluster 2
phy 4 is wasted real estate until step 2 puts work there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…read

Hand off all ring-management state to the wiring thread that step 1
introduced. Sched threads no longer touch the wiring queue or the
last_task_alive advance path; the wiring thread is the sole writer for
those, eliminating the advance_lock CAS chain that all 4 sched used to
contend on.

What moves:
- drain_wiring_queue (orch → ready_queues): was sched 0 only, interleaved
  with its dispatch loop; now driven continuously by wiring_thread_run.
  Removed from scheduler_dispatch.cpp's main loop.
- advance_ring_pointers (last_task_alive walk + reset_for_reuse): was
  called from check_and_handle_consumed on every fanout-release path with
  an advance_lock try-CAS dance across 4 sched; now invoked only by the
  wiring thread, one ring at a time. CAS removed from sched-side
  check_and_handle_consumed in both PTO2_SCHED_PROFILING and non-profiling
  variants; the advance_lock itself stays as a defensive no-op (cheap
  uncontended; followup commit may drop it entirely).
- sync_to_sm (K=16 SM publish): rides along inside advance_ring_pointers,
  so it follows automatically. The final-drain pass in wiring_thread_run
  force-publishes once after completed_ so the orchestrator's next round /
  shutdown sees the latest counter regardless of the K=16 throttle.
- dep_pool.reclaim: already called lazily from inside drain_wiring_queue;
  since drain runs on wiring, reclaim does too.

Sched threads now only:
- poll AIC COND for FIN
- on_mixed_task_complete + fanout release (push to local_buf / global queue)
- CAS task_state to CONSUMED in check_and_handle_consumed
- dispatch_ready_tasks

The wiring loop polls task_state on its ring walk and detects the CONSUMED
transition. Same coherence cost as before (sched writes the CAS, wiring
reads), just without the extra advance_lock CAS by sched.

Net result, 100r noprof @ bd36 (paged_attention_unroll Case1):
  max sched_cost  1224.75us -> 1150.55us  (-74.2us, -6.1%)
  orch_cost        796.32us ->  852.75us  (+56us; wiring thread now
                                            hot in cluster 2, some cache
                                            pressure on orch)
  per-thread sched_cost  1221-1224us -> 1149-1150us
  All 4 sched threads now within 1us of each other (single-writer
  advance removed the asymmetric serial point).

Cumulative vs upstream main baseline (1447.5us): -297us, -20.5%.
Within 1.7us of A3 max sched_cost on the same workload (1148.8us).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rop dead backoff

Two small cleanups on top of the wiring-thread split:

1. l2_perf_aicpu_init_phase used to allocate (num_sched_threads + 1) phase
   buffers, assuming a single non-sched thread (the orchestrator). With the
   wiring thread inserted at exec_idx == sched_thread_num_, the
   orchestrator now sits at exec_idx == sched_thread_num_ + 1, so its
   phase buffer was never allocated and orchestrator phase records (level
   4) were silently dropped — the swimlane JSON came back without an
   "aicpu_orchestrator_phases" section. Add an optional
   num_total_aicpu_threads argument; the scheduler-side init call now
   passes aicpu_thread_num_ so every exec_idx gets its buffer.

2. wiring_thread_run had an idle-backoff scaffold (SPIN_BUDGET, idle_spins
   counter, SPIN_WAIT_HINT() on every 64 idle iters). On A5 onboard
   SPIN_WAIT_HINT() is a no-op, and in steady state wiring is rarely idle
   anyway (each iter either wires a task or advances a ring). Tested a
   real backoff (NOP-loop pause when idle) — it did not move orch_cost
   (cluster-2 pressure comes from wiring's *stores* to last_task_alive /
   wiring_queue.dequeue_pos, not from its reads) and cost ~10us on
   max sched_cost. Drop the scaffold entirely; the loop body is now just
   drain + advance.

100r noprof @ bd36 unchanged within noise (max sched_cost ≈ 1160us,
orch_cost ≈ 850us, all 4 sched uniform to within 1us).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unroll_manual_scope QK kernel on a5 was left behind when the unroll
variant gained the L1 double-buffered B-tile prefetch (two bMatTile
buffers; TLOAD next kj into the alternate buffer while TMATMUL+TSTORE
runs on the current one). a2a3 has the same optimisation in both unroll
and unroll_manual_scope; only a5 unroll_manual_scope was missing it.

Result: per-task QK exec on a5 manual_scope drops from ~96.5us to
~74.5us (-22us, -23%). Aggregate AICore wall on paged_attention_unroll
Case1 (batch=256, bd36) drops ~40us, and max sched_cost drops ~60us.

This is a straight kernel sync; no semantics change. The body is copied
verbatim from
tests/st/a5/tensormap_and_ringbuffer/paged_attention_unroll/kernels/aic/aic_qk_matmul.cpp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switches the orchestration entry of both paged_attention_unroll (auto
scope) and paged_attention_unroll_manual_scope (manual scope) from the
original depth-first emission (QK, SF, PV, UP per (b_idx, q_idx)) to a
two-pass scheme:
- Outer scope: pass 1 — emit all 256 QKs back-to-back.
- 256 inner scopes, one per (b_idx, q_idx): pass 2 — emit SF, PV, UP
  referencing the QK output (sij_buf) captured during pass 1.

Why: under depth-first, orch emits QK_i then SF_i / PV_i / UP_i which
all wait for QK_i to finish before they can dispatch. The wiring thread
sees a thin ready-queue (one QK per ~3us of orch time). With pass 1
batched, wiring observes 256 QKs in the queue in ~200us — the AICore
ramp-up wave starts sooner.

Storage: Tensor's default constructor is private, so the QK outputs are
held across passes in a TensorSlot[256] parallel array using placement
new (no Tensor[N] aggregate initialisation is possible). MAX_BQ is
fixed at 256 — Case1's batch*q_loop*bn_per upper bound — to keep stack
under ~36KB. (An earlier 1024-cap version put 1.28MB on stack and ran
~450us slower.)

Manual-scope variant additionally captures each QK's PTO2TaskId in a
parallel array so SF can wire its explicit set_dependencies({qk_task_id})
across the outer/inner scope boundary.

Wiring thread:
  manual_scope's `aicpu_thread_num` is bumped from 4 to 6 to claim the
  wiring slot introduced in commit d604dff0; auto-scope variant already
  had it at 5 (now 6 via the previous refactor).

Measurements, 100r noprof @ bd36, paged_attention_unroll Case1:
                                 max sched_cost   orch_cost
  unroll        before              1247us           767us
                this commit         1133us          ~980us  ← -114us wall
  manual_scope  before (old QK)     1247us         ~  -   us
                this commit (a9fcdb73 + QK kernel sync)
                                    1109us           924us  ← -138us wall

(orch_cost rises because wiring becomes active earlier and shares
cluster 2 with orch — extra cache invalidations on
wiring_queue.enqueue_pos and ring->fc.last_task_alive lines. orch is
still well under max sched_cost so it does not extend the wall.)

a2a3 / sim variants are untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bb98e969 (directa-pubonly-nobarrier: per-task lifecycle timestamps)
added fanin_zero_cycles + enter_global_queue_cycles to
l2_perf_aicpu_complete_record, taking the parameter list from 10 to 12
args. The directa branch updated the tensormap_and_ringbuffer caller
(scheduler_completion.cpp) but missed the host_build_graph runtime's
own aicpu_executor.cpp — those 4 call sites still passed the 10-arg
form, breaking the build the moment main was merged (main also brought
in the new signature, plus tensor_dump.h's dependency on
host_build_graph headers, so we can't just drop the runtime).

Pass 0 / 0 for fanin_zero / enter_global_queue since host_build_graph
doesn't track those timestamps. Pure compile fix — no perf change on
the tensormap_and_ringbuffer test path (this runtime isn't exercised
by paged_attention_unroll).
The directa-stage1 / directa-pubonly-nobarrier work hardcoded
ALLOWED_CPUS for the Full SKU (14-logical-cpu, SMT-enabled), where
wiring/orch land on cluster 2 (cpu 7/9 = phy 4/5 ht 0) and 4 sched on
cluster 3 (cpu 11/12/13/14 = phy 6/7 SMT pairs). Our dev box is the
9-cpu SKU: only cpu 0..8 exist (cpu 0 = system TS, cpu 1+2 = HCCL
is_share, cpu 3..8 = AICPU compute, no SMT on phy 2..7). Running the
Full-SKU layout here drops 7 of 8 launched threads at the gate (only
cpu 7 happens to be in ALLOWED_CPUS) and triggers a 2-second AICPU
stream-sync timeout.

Two local config changes to make this branch's gate work on the dev
box without changing the gate's filter-style design:

- ALLOWED_CPUS {11,12,13,14,7,9} -> {7,8,3,4,5,6}: same role-priority
  ordering convention (indices 0..N-3 = sched, N-2 = wiring, N-1 =
  orch), so wiring=cpu 5 (phy 4) + orch=cpu 6 (phy 5) stay on adjacent
  phys, and the 4 sched land on cpu 7/8/3/4. Result matches the
  cluster-1 placement that opt-a5-pa's cluster-based gate computes on
  this box, just hardcoded.

- PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH 14 -> 8: directa's gate
  barrier spins on `s_cpu_written < total_launched`. CANN dispatches
  at most 8 worker threads on our 9-cpu SKU (physIndex 1..8; cpu 0 is
  OS-reserved and never receives an exec invocation), so a value of 14
  hangs the gate forever waiting for the last 6 threads.

Both are local-box workarounds; if/when this branch needs to ship to
Full SKU again, revert the two literal changes and re-launch on a real
14-cpu device.
@poursoul poursoul force-pushed the a5-performance-opt branch from 1cf1689 to d9f7d6e Compare May 30, 2026 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant