Fix #897: cross-chip init barrier so peer-rank init skew never charges against per-rank stream sync budget#935
Conversation
… never charges against the per-rank stream sync budget ChipWorker.init has a long right tail under PTO2_RING_HEAP=4 GiB (per-rank device_malloc(16 GiB) pulls host long-tail into the critical path). Without coordination, a fast-init chip starts its first launch_aicore_kernel + aclrtSyncStream(PLATFORM_STREAM_SYNC_TIMEOUT_MS) window N seconds before a slower peer reaches the same point. Any cross-rank wait inside the dispatched op (HCCL notify, distributed allreduce, etc.) charges the slow peer's remaining init time against the fast peer's 2 s host stream sync budget — the cascade observed as the residual 1/30 fail after PR hw-native-sys#930. Add a new MailboxState::INIT_DONE (value 6). Chip child writes it once ChipWorker.init returns; parent's _start_hierarchical spin-waits for every chip child to reach INIT_DONE before dispatching anything (CTRL_PREPARE or TASK_READY), then resets each child to IDLE so the standard dispatch state machine resumes from the canonical "ready for work" state. Effect: all chip childrens' first aclrtSyncStream windows open at the same wall-clock instant, so the 2 s budget covers only actual op execution rather than absorbing peer init skew. ## Verification Same task-submit + locked-NPU + 30-round protocol as PR hw-native-sys#930, on pypto's tests/st/distributed/test_l3_allreduce.py with the issue hw-native-sys#897 trigger env (PTO2_RING_HEAP=4 GiB, PTO2_RING_TASK_WINDOW=131072, PTO2_RING_DEP_POOL=131072). | variant | pass | fail | sched_error_code=100 | host stream sync timeout | | -------------------- | ---- | ---- | -------------------- | ------------------------ | | upstream/main | 15 | 15 | 15 | (subsumed) | | PR hw-native-sys#930 alone | 29 | 1 | 0 | 1 | | PR hw-native-sys#930 + this PR | 30 | 0 | 0 | 0 | Composes with PR hw-native-sys#930 — they address different layers of the same distributed-init-skew failure mode: - PR hw-native-sys#930 eliminates the false-positive AICPU scheduler latch (`sched_error_code=100`) that fires from any idle thread at ~92 ms. - This PR eliminates the underlying skew so the outer host-side timer (`PLATFORM_STREAM_SYNC_TIMEOUT_MS = 2000`) also has no input event to fire on. Either fix alone leaves a residual; both together drop the failure rate to 0 under the issue's reproduction conditions. ## Scope The same `INIT_DONE` signaling could be wired into `_sub_worker_loop` and `_child_worker_loop` (L4+) for symmetry. Skipped here because sub workers do no device init (Python-only loops) and the L4+ inner Worker race is unobserved; addressing it would be a follow-up keyed to its own repro. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning Review limit reached
More reviews will be available in 37 minutes and 31 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…NNING task + wall-clock budget + deadlock fallback Issue hw-native-sys#897 documents an idle-thread false-latch in the AICPU scheduler: any thread reaching MAX_IDLE_ITERATIONS (~92 ms wall at ~115 ns/iter) unconditionally writes sched_error_code=PTO2_ERROR_SCHEDULER_TIMEOUT (100) to the shared header, which sibling threads see in handle_orchestrator_exit and abort. In distributed runs this routinely kills the runtime even when the only thread with a RUNNING task is making correct progress, because a peer rank's host-side device_malloc tail-latency easily exceeds 92 ms. Three changes, applied symmetrically to a2a3 and a5 tensormap_and_ringbuffer: 1. Replace the per-thread iteration-count timeout with a wall-clock budget (SCHEDULER_TIMEOUT_MS=5000 → SCHEDULER_TIMEOUT_CYCLES via the existing 50 MHz sys-cnt frequency). Iteration counts run at per-thread spin speed: a pure-idle thread hits MAX_IDLE_ITERATIONS in ~92 ms while a sibling that polls a COND register on a RUNNING task takes ~200 ms for the same iteration count. Letting the fast spinner race ahead and declare global fatal kills the slower-but-correct poller mid-poll. Wall-clock unifies the deadline. STALL_LOG_INTERVAL still drives per-thread observability on iteration count (diagnostic cadence, not load-bearing). 2. Gate the call to handle_timeout_exit on self_owns_running_task(thread_idx). A thread without an owned RUNNING task has no first-hand evidence of a stuck dispatch and must not declare global fatal on its own idle clock. If the wall-clock budget elapses on a thread with no evidence, refresh last_progress_ts so the next budget cycle starts fresh; the STALL diagnostic above still fires periodically so observability is preserved while the false-positive latch is suppressed. The thread that does own a stuck task will reach the budget on its own polls and latch with valid evidence (or recover when the COND register flips). 3. Second latch branch for "pre-dispatch / WAIT-only deadlock" (CodeRabbit feedback): if the wall-clock budget elapses AND no thread anywhere owns a RUNNING task AND tasks remain unfinished (completed_tasks_ < total_tasks_), latch. Without this, a genuine dependency-cycle deadlock with zero RUNNING work would spin forever — every ownerless thread would just refresh its timestamp and never latch, and the host-side aclrtSyncStream timeout is the only outer detector. The new no_thread_owns_running_task() helper makes the global-stuck condition cheap to test on the cold timeout path. The 5 s budget is sized to cover worst-observed distributed-init skew (reported as ~196 ms in hw-native-sys#897 round 12; PTO2_RING_HEAP=4 GiB pushes host-side device_malloc tail latency into the same range) plus typical HCCL wait, with margin. PLATFORM_STREAM_SYNC_TIMEOUT_MS=2000 on the host side remains as the outer hang detector for cases the AICPU never reaches. ## Verification Repro on shared a2a3 runner via task-submit, locked to a clean pair of NPUs, 30 rounds each, identical env from hw-native-sys#897: PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=131072 PTO2_RING_DEP_POOL=131072 Test: pypto's tests/st/distributed/test_l3_allreduce.py (the exact case named in hw-native-sys#897). Baseline = upstream/main (pre-fix); fix = this PR's branch. baseline 30 rounds: pass=15 fail=15 sched_error_code=100=15 this PR alone (initial): pass=29 fail=1 sched_error_code=100=0 this PR (with CodeRabbit changes): pass=30 fail=0 sched_error_code=100=0 (combined with hw-native-sys#935 chip-init barrier) The 1/30 residual from the initial version was a host-side aclrtSyncStream(2 s) timeout — the next outer timer down the stack, independent of the scheduler latch this PR fixes. Combining with hw-native-sys#935 (cross-chip init barrier) eliminates that residual at source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…NNING task + wall-clock budget + deadlock fallback Issue hw-native-sys#897 documents an idle-thread false-latch in the AICPU scheduler: any thread reaching MAX_IDLE_ITERATIONS (~92 ms wall at ~115 ns/iter) unconditionally writes sched_error_code=PTO2_ERROR_SCHEDULER_TIMEOUT (100) to the shared header, which sibling threads see in handle_orchestrator_exit and abort. In distributed runs this routinely kills the runtime even when the only thread with a RUNNING task is making correct progress, because a peer rank's host-side device_malloc tail-latency easily exceeds 92 ms. Three changes, applied symmetrically to a2a3 and a5 tensormap_and_ringbuffer: 1. Replace the per-thread iteration-count timeout with a wall-clock budget (SCHEDULER_TIMEOUT_MS=5000 → SCHEDULER_TIMEOUT_CYCLES via the existing 50 MHz sys-cnt frequency). Iteration counts run at per-thread spin speed: a pure-idle thread hits MAX_IDLE_ITERATIONS in ~92 ms while a sibling that polls a COND register on a RUNNING task takes ~200 ms for the same iteration count. Letting the fast spinner race ahead and declare global fatal kills the slower-but-correct poller mid-poll. Wall-clock unifies the deadline. STALL_LOG_INTERVAL still drives per-thread observability on iteration count (diagnostic cadence, not load-bearing). 2. Gate the call to handle_timeout_exit on self_owns_running_task(thread_idx). A thread without an owned RUNNING task has no first-hand evidence of a stuck dispatch and must not declare global fatal on its own idle clock. If the wall-clock budget elapses on a thread with no evidence, refresh last_progress_ts so the next budget cycle starts fresh; the STALL diagnostic above still fires periodically so observability is preserved while the false-positive latch is suppressed. The thread that does own a stuck task will reach the budget on its own polls and latch with valid evidence (or recover when the COND register flips). 3. Second latch branch for "pre-dispatch / WAIT-only deadlock" (CodeRabbit feedback): if the wall-clock budget elapses AND no thread anywhere owns a RUNNING task AND tasks remain unfinished (completed_tasks_ < total_tasks_), latch. Without this, a genuine dependency-cycle deadlock with zero RUNNING work would spin forever — every ownerless thread would just refresh its timestamp and never latch, and the host-side aclrtSyncStream timeout is the only outer detector. The new no_thread_owns_running_task() helper makes the global-stuck condition cheap to test on the cold timeout path. The 5 s budget is sized to cover worst-observed distributed-init skew (reported as ~196 ms in hw-native-sys#897 round 12; PTO2_RING_HEAP=4 GiB pushes host-side device_malloc tail latency into the same range) plus typical HCCL wait, with margin. PLATFORM_STREAM_SYNC_TIMEOUT_MS=2000 on the host side remains as the outer hang detector for cases the AICPU never reaches. ## Verification Repro on shared a2a3 runner via task-submit, locked to a clean pair of NPUs, 30 rounds each, identical env from hw-native-sys#897: PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=131072 PTO2_RING_DEP_POOL=131072 Test: pypto's tests/st/distributed/test_l3_allreduce.py (the exact case named in hw-native-sys#897). Baseline = upstream/main (pre-fix); fix = this PR's branch. baseline 30 rounds: pass=15 fail=15 sched_error_code=100=15 this PR alone (initial): pass=29 fail=1 sched_error_code=100=0 this PR (with CodeRabbit changes): pass=30 fail=0 sched_error_code=100=0 (combined with hw-native-sys#935 chip-init barrier) The 1/30 residual from the initial version was a host-side aclrtSyncStream(2 s) timeout — the next outer timer down the stack, independent of the scheduler latch this PR fixes. Combining with hw-native-sys#935 (cross-chip init barrier) eliminates that residual at source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…wall-clock budget + deadlock fallback (#930) Issue #897 documents an idle-thread false-latch in the AICPU scheduler: any thread reaching MAX_IDLE_ITERATIONS (~92 ms wall at ~115 ns/iter) unconditionally writes sched_error_code=PTO2_ERROR_SCHEDULER_TIMEOUT (100) to the shared header, which sibling threads see in handle_orchestrator_exit and abort. In distributed runs this routinely kills the runtime even when the only thread with a RUNNING task is making correct progress, because a peer rank's host-side device_malloc tail-latency easily exceeds 92 ms. Three changes, applied symmetrically to a2a3 and a5 tensormap_and_ringbuffer: 1. Replace the per-thread iteration-count timeout with a wall-clock budget (SCHEDULER_TIMEOUT_MS=5000 → SCHEDULER_TIMEOUT_CYCLES via the existing 50 MHz sys-cnt frequency). Iteration counts run at per-thread spin speed: a pure-idle thread hits MAX_IDLE_ITERATIONS in ~92 ms while a sibling that polls a COND register on a RUNNING task takes ~200 ms for the same iteration count. Letting the fast spinner race ahead and declare global fatal kills the slower-but-correct poller mid-poll. Wall-clock unifies the deadline. STALL_LOG_INTERVAL still drives per-thread observability on iteration count (diagnostic cadence, not load-bearing). 2. Gate the call to handle_timeout_exit on self_owns_running_task(thread_idx). A thread without an owned RUNNING task has no first-hand evidence of a stuck dispatch and must not declare global fatal on its own idle clock. If the wall-clock budget elapses on a thread with no evidence, refresh last_progress_ts so the next budget cycle starts fresh; the STALL diagnostic above still fires periodically so observability is preserved while the false-positive latch is suppressed. The thread that does own a stuck task will reach the budget on its own polls and latch with valid evidence (or recover when the COND register flips). 3. Second latch branch for "pre-dispatch / WAIT-only deadlock" (CodeRabbit feedback): if the wall-clock budget elapses AND no thread anywhere owns a RUNNING task AND tasks remain unfinished (completed_tasks_ < total_tasks_), latch. Without this, a genuine dependency-cycle deadlock with zero RUNNING work would spin forever — every ownerless thread would just refresh its timestamp and never latch, and the host-side aclrtSyncStream timeout is the only outer detector. The new no_thread_owns_running_task() helper makes the global-stuck condition cheap to test on the cold timeout path. The 5 s budget is sized to cover worst-observed distributed-init skew (reported as ~196 ms in #897 round 12; PTO2_RING_HEAP=4 GiB pushes host-side device_malloc tail latency into the same range) plus typical HCCL wait, with margin. PLATFORM_STREAM_SYNC_TIMEOUT_MS=2000 on the host side remains as the outer hang detector for cases the AICPU never reaches. ## Verification Repro on shared a2a3 runner via task-submit, locked to a clean pair of NPUs, 30 rounds each, identical env from #897: PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=131072 PTO2_RING_DEP_POOL=131072 Test: pypto's tests/st/distributed/test_l3_allreduce.py (the exact case named in #897). Baseline = upstream/main (pre-fix); fix = this PR's branch. baseline 30 rounds: pass=15 fail=15 sched_error_code=100=15 this PR alone (initial): pass=29 fail=1 sched_error_code=100=0 this PR (with CodeRabbit changes): pass=30 fail=0 sched_error_code=100=0 (combined with #935 chip-init barrier) The 1/30 residual from the initial version was a host-side aclrtSyncStream(2 s) timeout — the next outer timer down the stack, independent of the scheduler latch this PR fixes. Combining with #935 (cross-chip init barrier) eliminates that residual at source. Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Composes with #930 to fully close out #897.
The skew leak
ChipWorker.inithas a long right tail underPTO2_RING_HEAP=4 GiB(per-rankdevice_malloc(16 GiB)pulls host-side long-tail into the critical path). Without coordination, a fast-init chip starts its firstlaunch_aicore_kernel+aclrtSyncStream(PLATFORM_STREAM_SYNC_TIMEOUT_MS)window N seconds before a slow peer reaches the same point. Any cross-rank wait inside the dispatched op (HCCL notify, distributed allreduce, etc.) then charges the slow peer's remaining init time against the fast peer's 2 s host stream sync budget — which is the cascade observed as the residual 1/30 fail after #930.Before this fix, the 2 s
PLATFORM_STREAM_SYNC_TIMEOUT_MSwas load-bearing for two semantically different things: actual AICore op execution time and cross-rank init skew. The second item shouldn't be in that budget at all — init logically happens before the op starts.What changes
MailboxState::INIT_DONE(value 6) — child writes it onceChipWorker.initreturns successfully._start_hierarchicalspin-waits for every chip child to reachINIT_DONEbefore dispatching anything (CTRL_PREPARE,TASK_READY, …), then resets each child toIDLEso the standard dispatch state machine resumes from the canonical "ready for work" state.aclrtSyncStreamwindows open at the same wall-clock instant. The 2 s budget covers only actual op execution rather than absorbing peer init skew.Verification (task-submit + locked NPU pair, 30 rounds, identical env from #897)
pypto'stests/st/distributed/test_l3_allreduce.pywithPTO2_RING_HEAP=4 GiB+PTO2_RING_TASK_WINDOW=131072+PTO2_RING_DEP_POOL=131072:sched_error_code=100aclrtSyncStreamtimeoutWhy this is a separate PR from #930
#930 and this PR address the same #897 failure mode at different layers:
sched_error_code=100) — any idle thread reaching the 92 ms iter cap was killing the whole runtime regardless of whether it owned a stuck task. That's a correctness bug independent of init skew (an idle thread shouldn't be able to declare global fatal on its own observation).PLATFORM_STREAM_SYNC_TIMEOUT_MS = 2000) also has no input event to fire on. That's an architectural fix — host-side init time should never have been chargeable to a per-op timer.Either fix alone leaves a residual; both together drop the failure rate to 0 under #897's reproduction conditions.
Splitting also lets each fix be reviewed on its own merits: #930 is a hot-loop change in the AICPU scheduler (small but load-bearing); this PR is a startup-path Python change with one new mailbox state (low-risk).
Scope
The same
INIT_DONEsignaling could be wired into_sub_worker_loopand_child_worker_loop(L4+) for symmetry. Skipped here because sub workers do no device init (Python-only loops) and the L4+ inner Worker race is unobserved; addressing it would be a follow-up keyed to its own repro.Relationship to other open PRs
0xDEADBEEFCtrl regs placeholder + retry halMemCtl EACCES) — independent, addresses a different 507046 path ondev=11. No file overlap.Test plan
test_l3_allreducewith [Bug] Idle scheduler thread independently latches PTO2_ERROR_SCHEDULER_TIMEOUT, causing fatal cascade in distributed runs #897 trigger env → 30 pass / 0 fail (combined with Fix #897: gate AICPU scheduler timeout latch on owned RUNNING task + wall-clock budget #930)sched_error_code=100and host-sideStream sync timeout: stream=AICPU— expect both to remain at zero🤖 Generated with Claude Code