Skip to content

Perf(runtime): defer fanout wiring to scheduler via wiring queue#496

Open
poursoul wants to merge 1 commit intohw-native-sys:mainfrom
poursoul:refactor-fanin
Open

Perf(runtime): defer fanout wiring to scheduler via wiring queue#496
poursoul wants to merge 1 commit intohw-native-sys:mainfrom
poursoul:refactor-fanin

Conversation

@poursoul
Copy link
Copy Markdown
Collaborator

@poursoul poursoul commented Apr 9, 2026

Move fanout edge construction (fanout_lock acquisition, dep_pool allocation, early_finished check, and ready-queue push) from the orchestrator's submit hot path to a dedicated wiring queue drained by scheduler thread 0. This reduces cross-core L2 cache and memory bus contention between orchestrator and scheduler threads.

Key changes:

  • Orchestrator submit (STEP 6) now only stores fanin metadata in payload and increments producers' fanout_count (no lock needed)
  • New PTO2SchedulerState::drain_wiring_queue() method handles all fanout wiring asynchronously
  • dep_pool ownership moved from PTO2RingSet to RingSchedState, exclusively managed by scheduler thread 0
  • Slot state initialization consolidated into pto2_prepare_task()
  • Scheduler profiling extended with wiring phase statistics
  • Fix pre-existing MD040/MD060 markdown lint errors in touched docs

Measured on paged_attention_unroll (Case1, 100 rounds):
entry_cost: 914 -> 739 us (-19%)
sched_cost: 1143 -> 1148 us (no regression)

Example Base (us) HEAD (us) Delta (us) Change (%)
alternating_matmul_add 916.0 785.9 -130.1 -14.20%
(orch) 915.8 785.5 -130.3 -14.23%
benchmark_bgemm 728.8 733.0 +4.2 +0.58%
(orch) 697.2 689.7 -7.5 -1.08%
paged_attention_unroll (Case1) 1146.3 1154.4 +8.1 +0.71%
(orch) 934.4 733.1 -201.3 -21.54%
paged_attention_unroll (Case2) 554.5 532.5 -22.0 -3.97%
(orch) 412.3 306.9 -105.4 -25.56%
batch_paged_attention 3165.6 2834.9 -330.7 -10.45%
(orch) 2529.9 1877.1 -652.8 -25.80%

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request moves fanout wiring from the orchestrator's submission hot path to a deferred wiring queue managed by the scheduler to reduce memory bus pressure. Feedback identifies a critical race condition where dep_pool_mark is assigned after tasks are pushed to the ready queue, which could cause incorrect memory reclamation. Other suggestions include restoring error observability during dependency pool initialization and optimizing the frequency of reclamation checks by grouping tasks by ring.

Move fanout edge construction (fanout_lock acquisition, dep_pool
allocation, early_finished check, and ready-queue push) from the
orchestrator's submit hot path to a dedicated wiring queue drained
by scheduler thread 0. This reduces cross-core L2 cache and memory
bus contention between orchestrator and scheduler threads.

Key changes:
- Orchestrator submit (STEP 6) now only stores fanin metadata in
  payload and increments producers' fanout_count (no lock needed)
- New PTO2SchedulerState::drain_wiring_queue() method handles all
  fanout wiring asynchronously
- dep_pool ownership moved from PTO2RingSet to RingSchedState,
  exclusively managed by scheduler thread 0
- Slot state initialization consolidated into pto2_prepare_task()
- Scheduler profiling extended with wiring phase statistics
- Fix pre-existing MD040/MD060 markdown lint errors in touched docs

Measured on paged_attention_unroll (Case1, 100 rounds):
  entry_cost: 914 -> 739 us (-19%)
  sched_cost: 1143 -> 1148 us (no regression)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants