Fix #897: cross-chip init barrier so peer-rank init skew never charges against per-rank stream sync budget by hw-native-sys-bot · Pull Request #935 · hw-native-sys/simpler

hw-native-sys-bot · 2026-05-31T04:31:20Z

Composes with #930 to fully close out #897.

The skew leak

ChipWorker.init has a long right tail under PTO2_RING_HEAP=4 GiB (per-rank device_malloc(16 GiB) pulls host-side long-tail into the critical path). Without coordination, a fast-init chip starts its first launch_aicore_kernel + aclrtSyncStream(PLATFORM_STREAM_SYNC_TIMEOUT_MS) window N seconds before a slow peer reaches the same point. Any cross-rank wait inside the dispatched op (HCCL notify, distributed allreduce, etc.) then charges the slow peer's remaining init time against the fast peer's 2 s host stream sync budget — which is the cascade observed as the residual 1/30 fail after #930.

Before this fix, the 2 s PLATFORM_STREAM_SYNC_TIMEOUT_MS was load-bearing for two semantically different things: actual AICore op execution time and cross-rank init skew. The second item shouldn't be in that budget at all — init logically happens before the op starts.

What changes

New MailboxState::INIT_DONE (value 6) — child writes it once ChipWorker.init returns successfully.
Parent's _start_hierarchical spin-waits for every chip child to reach INIT_DONE before dispatching anything (CTRL_PREPARE, TASK_READY, …), then resets each child to IDLE so the standard dispatch state machine resumes from the canonical "ready for work" state.
Net effect: all chip childrens' first aclrtSyncStream windows open at the same wall-clock instant. The 2 s budget covers only actual op execution rather than absorbing peer init skew.

Verification (task-submit + locked NPU pair, 30 rounds, identical env from #897)

pypto's tests/st/distributed/test_l3_allreduce.py with PTO2_RING_HEAP=4 GiB + PTO2_RING_TASK_WINDOW=131072 + PTO2_RING_DEP_POOL=131072:

variant	pass	fail	`sched_error_code=100`	host `aclrtSyncStream` timeout
upstream/main	15	15	15	(subsumed)
#930 alone	29	1	0	1
#930 + this PR	30	0	0	0

Why this is a separate PR from #930

#930 and this PR address the same #897 failure mode at different layers:

Fix #897: gate AICPU scheduler timeout latch on owned RUNNING task + wall-clock budget #930: eliminates the AICPU scheduler's false-positive fatal latch (sched_error_code=100) — any idle thread reaching the 92 ms iter cap was killing the whole runtime regardless of whether it owned a stuck task. That's a correctness bug independent of init skew (an idle thread shouldn't be able to declare global fatal on its own observation).
This PR: eliminates the underlying init skew so the outer host-side timer (PLATFORM_STREAM_SYNC_TIMEOUT_MS = 2000) also has no input event to fire on. That's an architectural fix — host-side init time should never have been chargeable to a per-op timer.

Either fix alone leaves a residual; both together drop the failure rate to 0 under #897's reproduction conditions.

Splitting also lets each fix be reviewed on its own merits: #930 is a hot-loop change in the AICPU scheduler (small but load-bearing); this PR is a startup-path Python change with one new mailbox state (low-risk).

Scope

The same INIT_DONE signaling could be wired into _sub_worker_loop and _child_worker_loop (L4+) for symmetry. Skipped here because sub workers do no device init (Python-only loops) and the L4+ inner Worker race is unobserved; addressing it would be a follow-up keyed to its own repro.

Relationship to other open PRs

Fix: drop 0xDEADBEEF Ctrl regs placeholder + retry halMemCtl EACCES race #925 (drop 0xDEADBEEF Ctrl regs placeholder + retry halMemCtl EACCES) — independent, addresses a different 507046 path on dev=11. No file overlap.
Fix #897: gate AICPU scheduler timeout latch on owned RUNNING task + wall-clock budget #930 (gate AICPU scheduler timeout latch on owned RUNNING task + wall-clock budget) — companion to this PR; same [Bug] Idle scheduler thread independently latches PTO2_ERROR_SCHEDULER_TIMEOUT, causing fatal cascade in distributed runs #897 root failure mode, different layer. No file overlap; both should land. Test plan above measures combined behaviour.

Test plan

CI green
Manual: 30× test_l3_allreduce with [Bug] Idle scheduler thread independently latches PTO2_ERROR_SCHEDULER_TIMEOUT, causing fatal cascade in distributed runs #897 trigger env → 30 pass / 0 fail (combined with Fix #897: gate AICPU scheduler timeout latch on owned RUNNING task + wall-clock budget #930)
Watch CI flake rate over 1-2 weeks for sched_error_code=100 and host-side Stream sync timeout: stream=AICPU — expect both to remain at zero

🤖 Generated with Claude Code

… never charges against the per-rank stream sync budget ChipWorker.init has a long right tail under PTO2_RING_HEAP=4 GiB (per-rank device_malloc(16 GiB) pulls host long-tail into the critical path). Without coordination, a fast-init chip starts its first launch_aicore_kernel + aclrtSyncStream(PLATFORM_STREAM_SYNC_TIMEOUT_MS) window N seconds before a slower peer reaches the same point. Any cross-rank wait inside the dispatched op (HCCL notify, distributed allreduce, etc.) charges the slow peer's remaining init time against the fast peer's 2 s host stream sync budget — the cascade observed as the residual 1/30 fail after PR hw-native-sys#930. Add a new MailboxState::INIT_DONE (value 6). Chip child writes it once ChipWorker.init returns; parent's _start_hierarchical spin-waits for every chip child to reach INIT_DONE before dispatching anything (CTRL_PREPARE or TASK_READY), then resets each child to IDLE so the standard dispatch state machine resumes from the canonical "ready for work" state. Effect: all chip childrens' first aclrtSyncStream windows open at the same wall-clock instant, so the 2 s budget covers only actual op execution rather than absorbing peer init skew. ## Verification Same task-submit + locked-NPU + 30-round protocol as PR hw-native-sys#930, on pypto's tests/st/distributed/test_l3_allreduce.py with the issue hw-native-sys#897 trigger env (PTO2_RING_HEAP=4 GiB, PTO2_RING_TASK_WINDOW=131072, PTO2_RING_DEP_POOL=131072). | variant | pass | fail | sched_error_code=100 | host stream sync timeout | | -------------------- | ---- | ---- | -------------------- | ------------------------ | | upstream/main | 15 | 15 | 15 | (subsumed) | | PR hw-native-sys#930 alone | 29 | 1 | 0 | 1 | | PR hw-native-sys#930 + this PR | 30 | 0 | 0 | 0 | Composes with PR hw-native-sys#930 — they address different layers of the same distributed-init-skew failure mode: - PR hw-native-sys#930 eliminates the false-positive AICPU scheduler latch (`sched_error_code=100`) that fires from any idle thread at ~92 ms. - This PR eliminates the underlying skew so the outer host-side timer (`PLATFORM_STREAM_SYNC_TIMEOUT_MS = 2000`) also has no input event to fire on. Either fix alone leaves a residual; both together drop the failure rate to 0 under the issue's reproduction conditions. ## Scope The same `INIT_DONE` signaling could be wired into `_sub_worker_loop` and `_child_worker_loop` (L4+) for symmetry. Skipped here because sub workers do no device init (Python-only loops) and the L4+ inner Worker race is unobserved; addressing it would be a follow-up keyed to its own repro. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist · 2026-05-31T04:31:23Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-05-31T04:31:26Z

Warning

Review limit reached

@hw-native-sys-bot, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 37 minutes and 31 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c5055c8f-9195-49a0-876b-ba3187a2acb5

📥 Commits

Reviewing files that changed from the base of the PR and between cee40dd and 5ad2334.

📒 Files selected for processing (2)

python/simpler/worker.py
src/common/hierarchical/worker_manager.h

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…NNING task + wall-clock budget + deadlock fallback Issue hw-native-sys#897 documents an idle-thread false-latch in the AICPU scheduler: any thread reaching MAX_IDLE_ITERATIONS (~92 ms wall at ~115 ns/iter) unconditionally writes sched_error_code=PTO2_ERROR_SCHEDULER_TIMEOUT (100) to the shared header, which sibling threads see in handle_orchestrator_exit and abort. In distributed runs this routinely kills the runtime even when the only thread with a RUNNING task is making correct progress, because a peer rank's host-side device_malloc tail-latency easily exceeds 92 ms. Three changes, applied symmetrically to a2a3 and a5 tensormap_and_ringbuffer: 1. Replace the per-thread iteration-count timeout with a wall-clock budget (SCHEDULER_TIMEOUT_MS=5000 → SCHEDULER_TIMEOUT_CYCLES via the existing 50 MHz sys-cnt frequency). Iteration counts run at per-thread spin speed: a pure-idle thread hits MAX_IDLE_ITERATIONS in ~92 ms while a sibling that polls a COND register on a RUNNING task takes ~200 ms for the same iteration count. Letting the fast spinner race ahead and declare global fatal kills the slower-but-correct poller mid-poll. Wall-clock unifies the deadline. STALL_LOG_INTERVAL still drives per-thread observability on iteration count (diagnostic cadence, not load-bearing). 2. Gate the call to handle_timeout_exit on self_owns_running_task(thread_idx). A thread without an owned RUNNING task has no first-hand evidence of a stuck dispatch and must not declare global fatal on its own idle clock. If the wall-clock budget elapses on a thread with no evidence, refresh last_progress_ts so the next budget cycle starts fresh; the STALL diagnostic above still fires periodically so observability is preserved while the false-positive latch is suppressed. The thread that does own a stuck task will reach the budget on its own polls and latch with valid evidence (or recover when the COND register flips). 3. Second latch branch for "pre-dispatch / WAIT-only deadlock" (CodeRabbit feedback): if the wall-clock budget elapses AND no thread anywhere owns a RUNNING task AND tasks remain unfinished (completed_tasks_ < total_tasks_), latch. Without this, a genuine dependency-cycle deadlock with zero RUNNING work would spin forever — every ownerless thread would just refresh its timestamp and never latch, and the host-side aclrtSyncStream timeout is the only outer detector. The new no_thread_owns_running_task() helper makes the global-stuck condition cheap to test on the cold timeout path. The 5 s budget is sized to cover worst-observed distributed-init skew (reported as ~196 ms in hw-native-sys#897 round 12; PTO2_RING_HEAP=4 GiB pushes host-side device_malloc tail latency into the same range) plus typical HCCL wait, with margin. PLATFORM_STREAM_SYNC_TIMEOUT_MS=2000 on the host side remains as the outer hang detector for cases the AICPU never reaches. ## Verification Repro on shared a2a3 runner via task-submit, locked to a clean pair of NPUs, 30 rounds each, identical env from hw-native-sys#897: PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=131072 PTO2_RING_DEP_POOL=131072 Test: pypto's tests/st/distributed/test_l3_allreduce.py (the exact case named in hw-native-sys#897). Baseline = upstream/main (pre-fix); fix = this PR's branch. baseline 30 rounds: pass=15 fail=15 sched_error_code=100=15 this PR alone (initial): pass=29 fail=1 sched_error_code=100=0 this PR (with CodeRabbit changes): pass=30 fail=0 sched_error_code=100=0 (combined with hw-native-sys#935 chip-init barrier) The 1/30 residual from the initial version was a host-side aclrtSyncStream(2 s) timeout — the next outer timer down the stack, independent of the scheduler latch this PR fixes. Combining with hw-native-sys#935 (cross-chip init barrier) eliminates that residual at source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…wall-clock budget + deadlock fallback (#930) Issue #897 documents an idle-thread false-latch in the AICPU scheduler: any thread reaching MAX_IDLE_ITERATIONS (~92 ms wall at ~115 ns/iter) unconditionally writes sched_error_code=PTO2_ERROR_SCHEDULER_TIMEOUT (100) to the shared header, which sibling threads see in handle_orchestrator_exit and abort. In distributed runs this routinely kills the runtime even when the only thread with a RUNNING task is making correct progress, because a peer rank's host-side device_malloc tail-latency easily exceeds 92 ms. Three changes, applied symmetrically to a2a3 and a5 tensormap_and_ringbuffer: 1. Replace the per-thread iteration-count timeout with a wall-clock budget (SCHEDULER_TIMEOUT_MS=5000 → SCHEDULER_TIMEOUT_CYCLES via the existing 50 MHz sys-cnt frequency). Iteration counts run at per-thread spin speed: a pure-idle thread hits MAX_IDLE_ITERATIONS in ~92 ms while a sibling that polls a COND register on a RUNNING task takes ~200 ms for the same iteration count. Letting the fast spinner race ahead and declare global fatal kills the slower-but-correct poller mid-poll. Wall-clock unifies the deadline. STALL_LOG_INTERVAL still drives per-thread observability on iteration count (diagnostic cadence, not load-bearing). 2. Gate the call to handle_timeout_exit on self_owns_running_task(thread_idx). A thread without an owned RUNNING task has no first-hand evidence of a stuck dispatch and must not declare global fatal on its own idle clock. If the wall-clock budget elapses on a thread with no evidence, refresh last_progress_ts so the next budget cycle starts fresh; the STALL diagnostic above still fires periodically so observability is preserved while the false-positive latch is suppressed. The thread that does own a stuck task will reach the budget on its own polls and latch with valid evidence (or recover when the COND register flips). 3. Second latch branch for "pre-dispatch / WAIT-only deadlock" (CodeRabbit feedback): if the wall-clock budget elapses AND no thread anywhere owns a RUNNING task AND tasks remain unfinished (completed_tasks_ < total_tasks_), latch. Without this, a genuine dependency-cycle deadlock with zero RUNNING work would spin forever — every ownerless thread would just refresh its timestamp and never latch, and the host-side aclrtSyncStream timeout is the only outer detector. The new no_thread_owns_running_task() helper makes the global-stuck condition cheap to test on the cold timeout path. The 5 s budget is sized to cover worst-observed distributed-init skew (reported as ~196 ms in #897 round 12; PTO2_RING_HEAP=4 GiB pushes host-side device_malloc tail latency into the same range) plus typical HCCL wait, with margin. PLATFORM_STREAM_SYNC_TIMEOUT_MS=2000 on the host side remains as the outer hang detector for cases the AICPU never reaches. ## Verification Repro on shared a2a3 runner via task-submit, locked to a clean pair of NPUs, 30 rounds each, identical env from #897: PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=131072 PTO2_RING_DEP_POOL=131072 Test: pypto's tests/st/distributed/test_l3_allreduce.py (the exact case named in #897). Baseline = upstream/main (pre-fix); fix = this PR's branch. baseline 30 rounds: pass=15 fail=15 sched_error_code=100=15 this PR alone (initial): pass=29 fail=1 sched_error_code=100=0 this PR (with CodeRabbit changes): pass=30 fail=0 sched_error_code=100=0 (combined with #935 chip-init barrier) The 1/30 residual from the initial version was a host-side aclrtSyncStream(2 s) timeout — the next outer timer down the stack, independent of the scheduler latch this PR fixes. Combining with #935 (cross-chip init barrier) eliminates that residual at source. Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>

This was referenced May 31, 2026

Fix: drop 0xDEADBEEF Ctrl regs placeholder + retry halMemCtl EACCES race #925

Merged

Fix #897: gate AICPU scheduler timeout latch on owned RUNNING task + wall-clock budget #930

Merged

ChaoWao approved these changes May 31, 2026

View reviewed changes

ChaoWao merged commit cd60051 into hw-native-sys:main May 31, 2026
16 checks passed

ChaoWao deleted the fix/chip-init-barrier branch May 31, 2026 06:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #897: cross-chip init barrier so peer-rank init skew never charges against per-rank stream sync budget#935

Fix #897: cross-chip init barrier so peer-rank init skew never charges against per-rank stream sync budget#935
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:fix/chip-init-barrier

hw-native-sys-bot commented May 31, 2026

Uh oh!

gemini-code-assist Bot commented May 31, 2026

Uh oh!

coderabbitai Bot commented May 31, 2026

Review limit reached

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hw-native-sys-bot commented May 31, 2026

The skew leak

What changes

Verification (task-submit + locked NPU pair, 30 rounds, identical env from #897)

Why this is a separate PR from #930

Scope

Relationship to other open PRs

Test plan

Uh oh!

gemini-code-assist Bot commented May 31, 2026

Uh oh!

coderabbitai Bot commented May 31, 2026

Review limit reached

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants