Skip to content

Fix #897: cross-chip init barrier so peer-rank init skew never charges against per-rank stream sync budget#935

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:fix/chip-init-barrier
May 31, 2026
Merged

Fix #897: cross-chip init barrier so peer-rank init skew never charges against per-rank stream sync budget#935
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:fix/chip-init-barrier

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

Composes with #930 to fully close out #897.

The skew leak

ChipWorker.init has a long right tail under PTO2_RING_HEAP=4 GiB (per-rank device_malloc(16 GiB) pulls host-side long-tail into the critical path). Without coordination, a fast-init chip starts its first launch_aicore_kernel + aclrtSyncStream(PLATFORM_STREAM_SYNC_TIMEOUT_MS) window N seconds before a slow peer reaches the same point. Any cross-rank wait inside the dispatched op (HCCL notify, distributed allreduce, etc.) then charges the slow peer's remaining init time against the fast peer's 2 s host stream sync budget — which is the cascade observed as the residual 1/30 fail after #930.

Before this fix, the 2 s PLATFORM_STREAM_SYNC_TIMEOUT_MS was load-bearing for two semantically different things: actual AICore op execution time and cross-rank init skew. The second item shouldn't be in that budget at all — init logically happens before the op starts.

What changes

  • New MailboxState::INIT_DONE (value 6) — child writes it once ChipWorker.init returns successfully.
  • Parent's _start_hierarchical spin-waits for every chip child to reach INIT_DONE before dispatching anything (CTRL_PREPARE, TASK_READY, …), then resets each child to IDLE so the standard dispatch state machine resumes from the canonical "ready for work" state.
  • Net effect: all chip childrens' first aclrtSyncStream windows open at the same wall-clock instant. The 2 s budget covers only actual op execution rather than absorbing peer init skew.

Verification (task-submit + locked NPU pair, 30 rounds, identical env from #897)

pypto's tests/st/distributed/test_l3_allreduce.py with PTO2_RING_HEAP=4 GiB + PTO2_RING_TASK_WINDOW=131072 + PTO2_RING_DEP_POOL=131072:

variant pass fail sched_error_code=100 host aclrtSyncStream timeout
upstream/main 15 15 15 (subsumed)
#930 alone 29 1 0 1
#930 + this PR 30 0 0 0

Why this is a separate PR from #930

#930 and this PR address the same #897 failure mode at different layers:

  • Fix #897: gate AICPU scheduler timeout latch on owned RUNNING task + wall-clock budget #930: eliminates the AICPU scheduler's false-positive fatal latch (sched_error_code=100) — any idle thread reaching the 92 ms iter cap was killing the whole runtime regardless of whether it owned a stuck task. That's a correctness bug independent of init skew (an idle thread shouldn't be able to declare global fatal on its own observation).
  • This PR: eliminates the underlying init skew so the outer host-side timer (PLATFORM_STREAM_SYNC_TIMEOUT_MS = 2000) also has no input event to fire on. That's an architectural fix — host-side init time should never have been chargeable to a per-op timer.

Either fix alone leaves a residual; both together drop the failure rate to 0 under #897's reproduction conditions.

Splitting also lets each fix be reviewed on its own merits: #930 is a hot-loop change in the AICPU scheduler (small but load-bearing); this PR is a startup-path Python change with one new mailbox state (low-risk).

Scope

The same INIT_DONE signaling could be wired into _sub_worker_loop and _child_worker_loop (L4+) for symmetry. Skipped here because sub workers do no device init (Python-only loops) and the L4+ inner Worker race is unobserved; addressing it would be a follow-up keyed to its own repro.

Relationship to other open PRs

Test plan

🤖 Generated with Claude Code

… never charges against the per-rank stream sync budget

ChipWorker.init has a long right tail under PTO2_RING_HEAP=4 GiB
(per-rank device_malloc(16 GiB) pulls host long-tail into the critical
path). Without coordination, a fast-init chip starts its first
launch_aicore_kernel + aclrtSyncStream(PLATFORM_STREAM_SYNC_TIMEOUT_MS)
window N seconds before a slower peer reaches the same point. Any
cross-rank wait inside the dispatched op (HCCL notify, distributed
allreduce, etc.) charges the slow peer's remaining init time against
the fast peer's 2 s host stream sync budget — the cascade observed as
the residual 1/30 fail after PR hw-native-sys#930.

Add a new MailboxState::INIT_DONE (value 6). Chip child writes it once
ChipWorker.init returns; parent's _start_hierarchical spin-waits for
every chip child to reach INIT_DONE before dispatching anything
(CTRL_PREPARE or TASK_READY), then resets each child to IDLE so the
standard dispatch state machine resumes from the canonical "ready for
work" state.

Effect: all chip childrens' first aclrtSyncStream windows open at the
same wall-clock instant, so the 2 s budget covers only actual op
execution rather than absorbing peer init skew.

## Verification

Same task-submit + locked-NPU + 30-round protocol as PR hw-native-sys#930, on
pypto's tests/st/distributed/test_l3_allreduce.py with the issue
hw-native-sys#897 trigger env (PTO2_RING_HEAP=4 GiB, PTO2_RING_TASK_WINDOW=131072,
PTO2_RING_DEP_POOL=131072).

| variant              | pass | fail | sched_error_code=100 | host stream sync timeout |
| -------------------- | ---- | ---- | -------------------- | ------------------------ |
| upstream/main        | 15   | 15   | 15                   | (subsumed)               |
| PR hw-native-sys#930 alone        | 29   | 1    | 0                    | 1                        |
| PR hw-native-sys#930 + this PR    | 30   | 0    | 0                    | 0                        |

Composes with PR hw-native-sys#930 — they address different layers of the same
distributed-init-skew failure mode:
- PR hw-native-sys#930 eliminates the false-positive AICPU scheduler latch
  (`sched_error_code=100`) that fires from any idle thread at ~92 ms.
- This PR eliminates the underlying skew so the outer host-side timer
  (`PLATFORM_STREAM_SYNC_TIMEOUT_MS = 2000`) also has no input event
  to fire on.

Either fix alone leaves a residual; both together drop the failure
rate to 0 under the issue's reproduction conditions.

## Scope

The same `INIT_DONE` signaling could be wired into `_sub_worker_loop`
and `_child_worker_loop` (L4+) for symmetry. Skipped here because
sub workers do no device init (Python-only loops) and the L4+ inner
Worker race is unobserved; addressing it would be a follow-up keyed
to its own repro.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 31, 2026

Warning

Review limit reached

@hw-native-sys-bot, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 37 minutes and 31 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c5055c8f-9195-49a0-876b-ba3187a2acb5

📥 Commits

Reviewing files that changed from the base of the PR and between cee40dd and 5ad2334.

📒 Files selected for processing (2)
  • python/simpler/worker.py
  • src/common/hierarchical/worker_manager.h

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
…NNING task + wall-clock budget + deadlock fallback

Issue hw-native-sys#897 documents an idle-thread false-latch in the AICPU scheduler:
any thread reaching MAX_IDLE_ITERATIONS (~92 ms wall at ~115 ns/iter)
unconditionally writes sched_error_code=PTO2_ERROR_SCHEDULER_TIMEOUT (100)
to the shared header, which sibling threads see in handle_orchestrator_exit
and abort. In distributed runs this routinely kills the runtime even when
the only thread with a RUNNING task is making correct progress, because a
peer rank's host-side device_malloc tail-latency easily exceeds 92 ms.

Three changes, applied symmetrically to a2a3 and a5 tensormap_and_ringbuffer:

1. Replace the per-thread iteration-count timeout with a wall-clock budget
   (SCHEDULER_TIMEOUT_MS=5000 → SCHEDULER_TIMEOUT_CYCLES via the existing
   50 MHz sys-cnt frequency). Iteration counts run at per-thread spin speed:
   a pure-idle thread hits MAX_IDLE_ITERATIONS in ~92 ms while a sibling
   that polls a COND register on a RUNNING task takes ~200 ms for the same
   iteration count. Letting the fast spinner race ahead and declare global
   fatal kills the slower-but-correct poller mid-poll. Wall-clock unifies
   the deadline. STALL_LOG_INTERVAL still drives per-thread observability
   on iteration count (diagnostic cadence, not load-bearing).

2. Gate the call to handle_timeout_exit on self_owns_running_task(thread_idx).
   A thread without an owned RUNNING task has no first-hand evidence of a
   stuck dispatch and must not declare global fatal on its own idle clock.
   If the wall-clock budget elapses on a thread with no evidence, refresh
   last_progress_ts so the next budget cycle starts fresh; the STALL
   diagnostic above still fires periodically so observability is preserved
   while the false-positive latch is suppressed. The thread that does own
   a stuck task will reach the budget on its own polls and latch with
   valid evidence (or recover when the COND register flips).

3. Second latch branch for "pre-dispatch / WAIT-only deadlock" (CodeRabbit
   feedback): if the wall-clock budget elapses AND no thread anywhere
   owns a RUNNING task AND tasks remain unfinished
   (completed_tasks_ < total_tasks_), latch. Without this, a genuine
   dependency-cycle deadlock with zero RUNNING work would spin forever —
   every ownerless thread would just refresh its timestamp and never
   latch, and the host-side aclrtSyncStream timeout is the only outer
   detector. The new no_thread_owns_running_task() helper makes the
   global-stuck condition cheap to test on the cold timeout path.

The 5 s budget is sized to cover worst-observed distributed-init skew
(reported as ~196 ms in hw-native-sys#897 round 12; PTO2_RING_HEAP=4 GiB pushes
host-side device_malloc tail latency into the same range) plus typical
HCCL wait, with margin. PLATFORM_STREAM_SYNC_TIMEOUT_MS=2000 on the host
side remains as the outer hang detector for cases the AICPU never reaches.

## Verification

Repro on shared a2a3 runner via task-submit, locked to a clean pair of
NPUs, 30 rounds each, identical env from hw-native-sys#897:
  PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=131072
  PTO2_RING_DEP_POOL=131072

Test: pypto's tests/st/distributed/test_l3_allreduce.py (the exact case
named in hw-native-sys#897). Baseline = upstream/main (pre-fix); fix = this PR's branch.

  baseline 30 rounds:                 pass=15 fail=15 sched_error_code=100=15
  this PR alone (initial):            pass=29 fail=1  sched_error_code=100=0
  this PR (with CodeRabbit changes):  pass=30 fail=0  sched_error_code=100=0
                                      (combined with hw-native-sys#935 chip-init barrier)

The 1/30 residual from the initial version was a host-side
aclrtSyncStream(2 s) timeout — the next outer timer down the stack,
independent of the scheduler latch this PR fixes. Combining with hw-native-sys#935
(cross-chip init barrier) eliminates that residual at source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ChaoWao ChaoWao merged commit cd60051 into hw-native-sys:main May 31, 2026
16 checks passed
@ChaoWao ChaoWao deleted the fix/chip-init-barrier branch May 31, 2026 06:42
ChaoWao added a commit to hw-native-sys-bot/simpler that referenced this pull request May 31, 2026
…NNING task + wall-clock budget + deadlock fallback

Issue hw-native-sys#897 documents an idle-thread false-latch in the AICPU scheduler:
any thread reaching MAX_IDLE_ITERATIONS (~92 ms wall at ~115 ns/iter)
unconditionally writes sched_error_code=PTO2_ERROR_SCHEDULER_TIMEOUT (100)
to the shared header, which sibling threads see in handle_orchestrator_exit
and abort. In distributed runs this routinely kills the runtime even when
the only thread with a RUNNING task is making correct progress, because a
peer rank's host-side device_malloc tail-latency easily exceeds 92 ms.

Three changes, applied symmetrically to a2a3 and a5 tensormap_and_ringbuffer:

1. Replace the per-thread iteration-count timeout with a wall-clock budget
   (SCHEDULER_TIMEOUT_MS=5000 → SCHEDULER_TIMEOUT_CYCLES via the existing
   50 MHz sys-cnt frequency). Iteration counts run at per-thread spin speed:
   a pure-idle thread hits MAX_IDLE_ITERATIONS in ~92 ms while a sibling
   that polls a COND register on a RUNNING task takes ~200 ms for the same
   iteration count. Letting the fast spinner race ahead and declare global
   fatal kills the slower-but-correct poller mid-poll. Wall-clock unifies
   the deadline. STALL_LOG_INTERVAL still drives per-thread observability
   on iteration count (diagnostic cadence, not load-bearing).

2. Gate the call to handle_timeout_exit on self_owns_running_task(thread_idx).
   A thread without an owned RUNNING task has no first-hand evidence of a
   stuck dispatch and must not declare global fatal on its own idle clock.
   If the wall-clock budget elapses on a thread with no evidence, refresh
   last_progress_ts so the next budget cycle starts fresh; the STALL
   diagnostic above still fires periodically so observability is preserved
   while the false-positive latch is suppressed. The thread that does own
   a stuck task will reach the budget on its own polls and latch with
   valid evidence (or recover when the COND register flips).

3. Second latch branch for "pre-dispatch / WAIT-only deadlock" (CodeRabbit
   feedback): if the wall-clock budget elapses AND no thread anywhere
   owns a RUNNING task AND tasks remain unfinished
   (completed_tasks_ < total_tasks_), latch. Without this, a genuine
   dependency-cycle deadlock with zero RUNNING work would spin forever —
   every ownerless thread would just refresh its timestamp and never
   latch, and the host-side aclrtSyncStream timeout is the only outer
   detector. The new no_thread_owns_running_task() helper makes the
   global-stuck condition cheap to test on the cold timeout path.

The 5 s budget is sized to cover worst-observed distributed-init skew
(reported as ~196 ms in hw-native-sys#897 round 12; PTO2_RING_HEAP=4 GiB pushes
host-side device_malloc tail latency into the same range) plus typical
HCCL wait, with margin. PLATFORM_STREAM_SYNC_TIMEOUT_MS=2000 on the host
side remains as the outer hang detector for cases the AICPU never reaches.

## Verification

Repro on shared a2a3 runner via task-submit, locked to a clean pair of
NPUs, 30 rounds each, identical env from hw-native-sys#897:
  PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=131072
  PTO2_RING_DEP_POOL=131072

Test: pypto's tests/st/distributed/test_l3_allreduce.py (the exact case
named in hw-native-sys#897). Baseline = upstream/main (pre-fix); fix = this PR's branch.

  baseline 30 rounds:                 pass=15 fail=15 sched_error_code=100=15
  this PR alone (initial):            pass=29 fail=1  sched_error_code=100=0
  this PR (with CodeRabbit changes):  pass=30 fail=0  sched_error_code=100=0
                                      (combined with hw-native-sys#935 chip-init barrier)

The 1/30 residual from the initial version was a host-side
aclrtSyncStream(2 s) timeout — the next outer timer down the stack,
independent of the scheduler latch this PR fixes. Combining with hw-native-sys#935
(cross-chip init barrier) eliminates that residual at source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChaoWao added a commit that referenced this pull request May 31, 2026
…wall-clock budget + deadlock fallback (#930)

Issue #897 documents an idle-thread false-latch in the AICPU scheduler:
any thread reaching MAX_IDLE_ITERATIONS (~92 ms wall at ~115 ns/iter)
unconditionally writes sched_error_code=PTO2_ERROR_SCHEDULER_TIMEOUT (100)
to the shared header, which sibling threads see in handle_orchestrator_exit
and abort. In distributed runs this routinely kills the runtime even when
the only thread with a RUNNING task is making correct progress, because a
peer rank's host-side device_malloc tail-latency easily exceeds 92 ms.

Three changes, applied symmetrically to a2a3 and a5 tensormap_and_ringbuffer:

1. Replace the per-thread iteration-count timeout with a wall-clock budget
   (SCHEDULER_TIMEOUT_MS=5000 → SCHEDULER_TIMEOUT_CYCLES via the existing
   50 MHz sys-cnt frequency). Iteration counts run at per-thread spin speed:
   a pure-idle thread hits MAX_IDLE_ITERATIONS in ~92 ms while a sibling
   that polls a COND register on a RUNNING task takes ~200 ms for the same
   iteration count. Letting the fast spinner race ahead and declare global
   fatal kills the slower-but-correct poller mid-poll. Wall-clock unifies
   the deadline. STALL_LOG_INTERVAL still drives per-thread observability
   on iteration count (diagnostic cadence, not load-bearing).

2. Gate the call to handle_timeout_exit on self_owns_running_task(thread_idx).
   A thread without an owned RUNNING task has no first-hand evidence of a
   stuck dispatch and must not declare global fatal on its own idle clock.
   If the wall-clock budget elapses on a thread with no evidence, refresh
   last_progress_ts so the next budget cycle starts fresh; the STALL
   diagnostic above still fires periodically so observability is preserved
   while the false-positive latch is suppressed. The thread that does own
   a stuck task will reach the budget on its own polls and latch with
   valid evidence (or recover when the COND register flips).

3. Second latch branch for "pre-dispatch / WAIT-only deadlock" (CodeRabbit
   feedback): if the wall-clock budget elapses AND no thread anywhere
   owns a RUNNING task AND tasks remain unfinished
   (completed_tasks_ < total_tasks_), latch. Without this, a genuine
   dependency-cycle deadlock with zero RUNNING work would spin forever —
   every ownerless thread would just refresh its timestamp and never
   latch, and the host-side aclrtSyncStream timeout is the only outer
   detector. The new no_thread_owns_running_task() helper makes the
   global-stuck condition cheap to test on the cold timeout path.

The 5 s budget is sized to cover worst-observed distributed-init skew
(reported as ~196 ms in #897 round 12; PTO2_RING_HEAP=4 GiB pushes
host-side device_malloc tail latency into the same range) plus typical
HCCL wait, with margin. PLATFORM_STREAM_SYNC_TIMEOUT_MS=2000 on the host
side remains as the outer hang detector for cases the AICPU never reaches.

## Verification

Repro on shared a2a3 runner via task-submit, locked to a clean pair of
NPUs, 30 rounds each, identical env from #897:
  PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=131072
  PTO2_RING_DEP_POOL=131072

Test: pypto's tests/st/distributed/test_l3_allreduce.py (the exact case
named in #897). Baseline = upstream/main (pre-fix); fix = this PR's branch.

  baseline 30 rounds:                 pass=15 fail=15 sched_error_code=100=15
  this PR alone (initial):            pass=29 fail=1  sched_error_code=100=0
  this PR (with CodeRabbit changes):  pass=30 fail=0  sched_error_code=100=0
                                      (combined with #935 chip-init barrier)

The 1/30 residual from the initial version was a host-side
aclrtSyncStream(2 s) timeout — the next outer timer down the stack,
independent of the scheduler latch this PR fixes. Combining with #935
(cross-chip init barrier) eliminates that residual at source.

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants