Perf(runtime): defer fanout wiring to scheduler via wiring queue#496
Open
poursoul wants to merge 1 commit intohw-native-sys:mainfrom
Open
Perf(runtime): defer fanout wiring to scheduler via wiring queue#496poursoul wants to merge 1 commit intohw-native-sys:mainfrom
poursoul wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request moves fanout wiring from the orchestrator's submission hot path to a deferred wiring queue managed by the scheduler to reduce memory bus pressure. Feedback identifies a critical race condition where dep_pool_mark is assigned after tasks are pushed to the ready queue, which could cause incorrect memory reclamation. Other suggestions include restoring error observability during dependency pool initialization and optimizing the frequency of reclamation checks by grouping tasks by ring.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
Outdated
Show resolved
Hide resolved
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
Outdated
Show resolved
Hide resolved
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.cpp
Outdated
Show resolved
Hide resolved
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.cpp
Outdated
Show resolved
Hide resolved
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
Outdated
Show resolved
Hide resolved
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
Outdated
Show resolved
Hide resolved
b938314 to
45baf73
Compare
Move fanout edge construction (fanout_lock acquisition, dep_pool allocation, early_finished check, and ready-queue push) from the orchestrator's submit hot path to a dedicated wiring queue drained by scheduler thread 0. This reduces cross-core L2 cache and memory bus contention between orchestrator and scheduler threads. Key changes: - Orchestrator submit (STEP 6) now only stores fanin metadata in payload and increments producers' fanout_count (no lock needed) - New PTO2SchedulerState::drain_wiring_queue() method handles all fanout wiring asynchronously - dep_pool ownership moved from PTO2RingSet to RingSchedState, exclusively managed by scheduler thread 0 - Slot state initialization consolidated into pto2_prepare_task() - Scheduler profiling extended with wiring phase statistics - Fix pre-existing MD040/MD060 markdown lint errors in touched docs Measured on paged_attention_unroll (Case1, 100 rounds): entry_cost: 914 -> 739 us (-19%) sched_cost: 1143 -> 1148 us (no regression)
45baf73 to
c0a95cf
Compare
ChaoWao
approved these changes
Apr 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Move fanout edge construction (fanout_lock acquisition, dep_pool allocation, early_finished check, and ready-queue push) from the orchestrator's submit hot path to a dedicated wiring queue drained by scheduler thread 0. This reduces cross-core L2 cache and memory bus contention between orchestrator and scheduler threads.
Key changes:
Measured on paged_attention_unroll (Case1, 100 rounds):
entry_cost: 914 -> 739 us (-19%)
sched_cost: 1143 -> 1148 us (no regression)