Add: dual-slot AICPU dispatch payload and two-phase pipelining scheduler#477
Open
zhusy54 wants to merge 1 commit intohw-native-sys:mainfrom
Open
Add: dual-slot AICPU dispatch payload and two-phase pipelining scheduler#477zhusy54 wants to merge 1 commit intohw-native-sys:mainfrom
zhusy54 wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a dual-slot dispatch mechanism to the AICPU executor, allowing for task pipelining by tracking both running and pending tasks per core. It updates the CoreTracker and CoreExecState to manage these slots and implements a two-phase dispatch logic that prioritizes idle cores before filling pending slots. Additionally, the AICore performance collector was optimized to use a caller-maintained write index, reducing cache invalidation overhead. A fix for simulation environments was also included to prevent payload corruption. Review feedback identifies opportunities to optimize the scheduler's hot path by removing redundant bitmask refreshes.
src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
Outdated
Show resolved
Hide resolved
src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
Outdated
Show resolved
Hide resolved
4abf76b to
aa95038
Compare
… with dual-watermark deferred task release - Introduce two-slot dispatch payload (slot 0 / slot 1) for AICPU - Implement two-phase pipelining: dispatch phase and execute phase - Add dual-watermark mechanism for deferred task release
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
payload while AICore is still executing the current one (true pipelining)
is occupied, eliminating idle cycles between consecutive kernel launches
Key Changes
s_pto2_payload_per_core: extended from single-buffer to[RUNTIME_MAX_WORKER][2];slot selected by
reg_task_id & 1u, consistent between AICPU (write) and AICore (read)CoreExecState: add parallel running/pending field pairs (slot_state,reg_task_id,subslot,dispatch_timestamp); renameexecuting_*→running_*CoreTracker: addpending_occupied_BitStates withget_idle_cluster_offset_states(both slots free) andget_pending_only_cluster_offset_states(core running, pending slot free)for two-phase dispatch
decide_slot_transition()pure function to decode register eventsinto
SlotTransitionflags; extractcomplete_slot_task()helper for thecompletion hot path
pipe_barrier(PIPE_ALL)before kernel execution and selectexec_payloadviapayload + (task_id & 1u); simulation no-op fallback addedTASK_ID_MASKTesting
a2a3simand hardware paths