Fix #897: gate AICPU scheduler timeout latch on owned RUNNING task + wall-clock budget#930
Conversation
📝 WalkthroughWalkthroughThis PR addresses false-positive scheduler timeout crashes in distributed runs by replacing iteration-count timeouts with wall-clock timeouts and gating fatal-latch to threads that own cores with active RUNNING tasks. Both a2a3 and a5 scheduler variants receive identical changes. ChangesScheduler Timeout with Ownership Gating
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request replaces the per-thread iteration-count cap for declaring a scheduler timeout with a wall-clock budget of 5 seconds across the a2a3 and a5 runtimes. It introduces the self_owns_running_task helper to ensure that only threads owning a core with an active RUNNING task can trigger a global fatal timeout, preventing false-positive latches caused by startup skew or idle spinning in distributed environments. There are no review comments, and we have no additional feedback to provide.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`:
- Around line 766-787: The timeout suppression currently resets last_progress_ts
whenever the local thread does not own RUNNING work, which hides global
deadlocks when completed_tasks_ < total_tasks_ and no thread owns any RUNNING
slot; change the branch so that before simply refreshing last_progress_ts you
check for the global no-RUNNING condition and, if true and tasks remain, invoke
the same fatal path (handle_timeout_exit(...)) for this thread. Concretely:
inside the if (get_sys_cnt_aicpu() - last_progress_ts >
SCHEDULER_TIMEOUT_CYCLES) block, after the self_owns_running_task(thread_idx)
check fails, add a check like "if (completed_tasks_ < total_tasks_ &&
no_thread_owns_running()) return handle_timeout_exit(...)" (implement
no_thread_owns_running by scanning scheduler thread slots or using the existing
global running-count helper), otherwise keep updating last_progress_ts as
before.
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h`:
- Around line 63-65: Replace the hardcoded constexpr SCHEDULER_TIMEOUT_MS (5000)
with a configurable knob: read a compile-time macro PTO2_SCHED_TIMEOUT_MS (fall
back to 5000 if undefined) or a runtime-config value and compute
SCHEDULER_TIMEOUT_CYCLES from that value (using PLATFORM_PROF_SYS_CNT_FREQ as
before); update uses of SCHEDULER_TIMEOUT_MS and SCHEDULER_TIMEOUT_CYCLES so
they reference the new configurable symbol (e.g., PTO2_SCHED_TIMEOUT_MS or a
getter like get_scheduler_timeout_ms()) and ensure the multiplication still
casts to uint64_t to avoid overflow.
In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`:
- Around line 762-783: The timeout reset currently suppresses fatal timeouts
when no thread owns a RUNNING slot, which hides true deadlocks; update the
timeout branch so that after detecting get_sys_cnt_aicpu() - last_progress_ts >
SCHEDULER_TIMEOUT_CYCLES you still call handle_timeout_exit(...) when there are
unfinished tasks and no scheduler thread owns any RUNNING slot. Concretely: keep
the existing self_owns_running_task(thread_idx) -> handle_timeout_exit(...)
path, add an else-if that checks (completed_tasks_ < total_tasks_) and that no
thread owns a RUNNING slot (implement by scanning the scheduler state or adding
a helper like any_thread_owns_running() if none exists), and call
handle_timeout_exit(...) in that case; only when neither condition holds should
you refresh last_progress_ts = get_sys_cnt_aicpu(). Ensure you use the same
handle_timeout_exit signature and existing symbols (last_progress_ts,
SCHEDULER_TIMEOUT_CYCLES, get_sys_cnt_aicpu(), completed_tasks_, total_tasks_,
self_owns_running_task(thread_idx)).
In `@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h`:
- Around line 64-66: The timeout is hardcoded via constexpr SCHEDULER_TIMEOUT_MS
and SCHEDULER_TIMEOUT_CYCLES; change these to be computed at runtime and
configurable (e.g., via environment variable or a runtime config API) with a
sensible default of 5000 ms. Replace constexpr SCHEDULER_TIMEOUT_MS with a
function or non-constexpr getter (e.g., getSchedulerTimeoutMs()) that reads a
config/env var and falls back to 5000, and recompute SCHEDULER_TIMEOUT_CYCLES
from that runtime value using PLATFORM_PROF_SYS_CNT_FREQ (e.g.,
getSchedulerTimeoutCycles() = uint64_t(getSchedulerTimeoutMs()) *
(PLATFORM_PROF_SYS_CNT_FREQ / 1000)). Ensure callers use the new getter(s)
instead of the removed constexpr symbols.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 89487ddc-5ded-4b95-8ece-70ff7393e82e
📒 Files selected for processing (8)
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h
1e50009 to
008b787
Compare
…s against the per-rank stream sync budget (#935) ChipWorker.init has a long right tail under PTO2_RING_HEAP=4 GiB (per-rank device_malloc(16 GiB) pulls host long-tail into the critical path). Without coordination, a fast-init chip starts its first launch_aicore_kernel + aclrtSyncStream(PLATFORM_STREAM_SYNC_TIMEOUT_MS) window N seconds before a slower peer reaches the same point. Any cross-rank wait inside the dispatched op (HCCL notify, distributed allreduce, etc.) charges the slow peer's remaining init time against the fast peer's 2 s host stream sync budget — the cascade observed as the residual 1/30 fail after PR #930. Add a new MailboxState::INIT_DONE (value 6). Chip child writes it once ChipWorker.init returns; parent's _start_hierarchical spin-waits for every chip child to reach INIT_DONE before dispatching anything (CTRL_PREPARE or TASK_READY), then resets each child to IDLE so the standard dispatch state machine resumes from the canonical "ready for work" state. Effect: all chip childrens' first aclrtSyncStream windows open at the same wall-clock instant, so the 2 s budget covers only actual op execution rather than absorbing peer init skew. ## Verification Same task-submit + locked-NPU + 30-round protocol as PR #930, on pypto's tests/st/distributed/test_l3_allreduce.py with the issue #897 trigger env (PTO2_RING_HEAP=4 GiB, PTO2_RING_TASK_WINDOW=131072, PTO2_RING_DEP_POOL=131072). | variant | pass | fail | sched_error_code=100 | host stream sync timeout | | -------------------- | ---- | ---- | -------------------- | ------------------------ | | upstream/main | 15 | 15 | 15 | (subsumed) | | PR #930 alone | 29 | 1 | 0 | 1 | | PR #930 + this PR | 30 | 0 | 0 | 0 | Composes with PR #930 — they address different layers of the same distributed-init-skew failure mode: - PR #930 eliminates the false-positive AICPU scheduler latch (`sched_error_code=100`) that fires from any idle thread at ~92 ms. - This PR eliminates the underlying skew so the outer host-side timer (`PLATFORM_STREAM_SYNC_TIMEOUT_MS = 2000`) also has no input event to fire on. Either fix alone leaves a residual; both together drop the failure rate to 0 under the issue's reproduction conditions. ## Scope The same `INIT_DONE` signaling could be wired into `_sub_worker_loop` and `_child_worker_loop` (L4+) for symmetry. Skipped here because sub workers do no device init (Python-only loops) and the L4+ inner Worker race is unobserved; addressing it would be a follow-up keyed to its own repro. Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
…NNING task + wall-clock budget + deadlock fallback Issue hw-native-sys#897 documents an idle-thread false-latch in the AICPU scheduler: any thread reaching MAX_IDLE_ITERATIONS (~92 ms wall at ~115 ns/iter) unconditionally writes sched_error_code=PTO2_ERROR_SCHEDULER_TIMEOUT (100) to the shared header, which sibling threads see in handle_orchestrator_exit and abort. In distributed runs this routinely kills the runtime even when the only thread with a RUNNING task is making correct progress, because a peer rank's host-side device_malloc tail-latency easily exceeds 92 ms. Three changes, applied symmetrically to a2a3 and a5 tensormap_and_ringbuffer: 1. Replace the per-thread iteration-count timeout with a wall-clock budget (SCHEDULER_TIMEOUT_MS=5000 → SCHEDULER_TIMEOUT_CYCLES via the existing 50 MHz sys-cnt frequency). Iteration counts run at per-thread spin speed: a pure-idle thread hits MAX_IDLE_ITERATIONS in ~92 ms while a sibling that polls a COND register on a RUNNING task takes ~200 ms for the same iteration count. Letting the fast spinner race ahead and declare global fatal kills the slower-but-correct poller mid-poll. Wall-clock unifies the deadline. STALL_LOG_INTERVAL still drives per-thread observability on iteration count (diagnostic cadence, not load-bearing). 2. Gate the call to handle_timeout_exit on self_owns_running_task(thread_idx). A thread without an owned RUNNING task has no first-hand evidence of a stuck dispatch and must not declare global fatal on its own idle clock. If the wall-clock budget elapses on a thread with no evidence, refresh last_progress_ts so the next budget cycle starts fresh; the STALL diagnostic above still fires periodically so observability is preserved while the false-positive latch is suppressed. The thread that does own a stuck task will reach the budget on its own polls and latch with valid evidence (or recover when the COND register flips). 3. Second latch branch for "pre-dispatch / WAIT-only deadlock" (CodeRabbit feedback): if the wall-clock budget elapses AND no thread anywhere owns a RUNNING task AND tasks remain unfinished (completed_tasks_ < total_tasks_), latch. Without this, a genuine dependency-cycle deadlock with zero RUNNING work would spin forever — every ownerless thread would just refresh its timestamp and never latch, and the host-side aclrtSyncStream timeout is the only outer detector. The new no_thread_owns_running_task() helper makes the global-stuck condition cheap to test on the cold timeout path. The 5 s budget is sized to cover worst-observed distributed-init skew (reported as ~196 ms in hw-native-sys#897 round 12; PTO2_RING_HEAP=4 GiB pushes host-side device_malloc tail latency into the same range) plus typical HCCL wait, with margin. PLATFORM_STREAM_SYNC_TIMEOUT_MS=2000 on the host side remains as the outer hang detector for cases the AICPU never reaches. ## Verification Repro on shared a2a3 runner via task-submit, locked to a clean pair of NPUs, 30 rounds each, identical env from hw-native-sys#897: PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=131072 PTO2_RING_DEP_POOL=131072 Test: pypto's tests/st/distributed/test_l3_allreduce.py (the exact case named in hw-native-sys#897). Baseline = upstream/main (pre-fix); fix = this PR's branch. baseline 30 rounds: pass=15 fail=15 sched_error_code=100=15 this PR alone (initial): pass=29 fail=1 sched_error_code=100=0 this PR (with CodeRabbit changes): pass=30 fail=0 sched_error_code=100=0 (combined with hw-native-sys#935 chip-init barrier) The 1/30 residual from the initial version was a host-side aclrtSyncStream(2 s) timeout — the next outer timer down the stack, independent of the scheduler latch this PR fixes. Combining with hw-native-sys#935 (cross-chip init barrier) eliminates that residual at source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
008b787 to
cbcc3a1
Compare
Fixes #897.
Summary
>= MAX_IDLE_ITERATIONS) with a wall-clock budget (SCHEDULER_TIMEOUT_MS = 5000). Iteration count is per-thread spin-speed-sensitive — a pure-idle thread hits the cap ~2× faster than a sibling that polls a RUNNING task — so the fastest spinner racing ahead is what drove [Bug] Idle scheduler thread independently latches PTO2_ERROR_SCHEDULER_TIMEOUT, causing fatal cascade in distributed runs #897's false-positive cascade.handle_timeout_exitonself_owns_running_task(thread_idx). A thread without an owned RUNNING task has no first-hand evidence its dispatch is stuck. If the wall-clock budget elapses on such a thread, refreshlast_progress_tsand keep spinning; the existing STALL diagnostic still fires periodically, so observability is preserved.src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/runtime/scheduler/.Verification (on shared a2a3 runner via task-submit, locked NPU pair, identical env)
Repro = pypto's
tests/st/distributed/test_l3_allreduce.py, the case named in #897, with the env vars #897 lists as triggers:sched_error_code=100countaclrtSynchronizeStreamWithTimeout(2 s)— that's the next outer hang detector down the stack (PLATFORM_STREAM_SYNC_TIMEOUT_MS=2000), independent of the scheduler-side latch this PR addresses. A follow-up may want to extend the host timeout for the same skew-tolerance reason; out of scope here.sched_error_code=100across 30 rounds with the fix; this is the exact signature [Bug] Idle scheduler thread independently latches PTO2_ERROR_SCHEDULER_TIMEOUT, causing fatal cascade in distributed runs #897 documented.Why these two changes specifically (not the alternatives in #897 body)
#897's author proposed three options (skip-latch-if-idle / barrier counter / wall-clock budget). The minimal correct fix is both the idle gate AND the wall-clock budget — independently, neither is sufficient:
Together: only a thread that both observed the global wall-clock budget elapse and owns a RUNNING task it has been polling will latch. The barrier-counter option (proposed #2 in the issue body) was not adopted — adding shared atomics on the fatal path is heavier than the gate-and-poll structure here, and the semantics around "all threads at idle limit" is harder to reason about during normal completion vs. true hang.
Out of scope
aclrtSyncStreamWithTimeouttimeout (the 1/30 in the verify column above). That's a separate skew-tolerance bug at the outer layer.PR #925(drop 0xDEADBEEF Ctrl placeholder + retry halMemCtl EACCES race) handled a different 507046 path on dev=11 specifically; unrelated to [Bug] Idle scheduler thread independently latches PTO2_ERROR_SCHEDULER_TIMEOUT, causing fatal cascade in distributed runs #897.Test plan
test_l3_allreducewith [Bug] Idle scheduler thread independently latches PTO2_ERROR_SCHEDULER_TIMEOUT, causing fatal cascade in distributed runs #897 trigger env → 29 pass / 1 fail with zerosched_error_code=100(vs. baseline 15/15 fail with 15 latch events)sched_error_code=100occurrences — expect to drop to zero🤖 Generated with Claude Code
Update: residual 1/30 also fixed by companion PR #935
After this PR, the same repro shows 29/30 pass; the 1 remaining fail is a host-side
aclrtSyncStream(2 s)timeout — the outer timer in the same #897 cascade, exposed once the inner scheduler latch is suppressed. Root cause:ChipWorker.initskew across chip_processes bleeds peer-rank init time into per-rank op sync budget.#935 adds a cross-chip init barrier (
MailboxState::INIT_DONE+ parent spin-wait) that eliminates the skew at source. With #930 + #935, the same 30-round repro shows 30/30 pass, zerosched_error_code=100, zero host stream sync timeouts.Two PRs because the layers are genuinely independent:
Both should land. Order doesn't matter; no file overlap.
Related
dev=11placeholder fill); independent.Update: CodeRabbit feedback addressed (commit 008b787)
scheduler_dispatch.cpp: if the wall-clock budget elapses andno_thread_owns_running_task()(new helper) andcompleted_tasks_ < total_tasks_, latch fatal. Without this, a dependency-cycle deadlock with zero RUNNING work would spin forever — every ownerless thread would just refresh its timestamp and never latch. Re-verified: same [Bug] Idle scheduler thread independently latches PTO2_ERROR_SCHEDULER_TIMEOUT, causing fatal cascade in distributed runs #897 30-round repro now passes 30/30 (was 29/30 before, the residual was independent host-side 2 s sync; cleared by composing with Fix #897: cross-chip init barrier so peer-rank init skew never charges against per-rank stream sync budget #935).PTO2_SCHED_TIMEOUT_MSenv knob) — skipped. The 5 s default validates 30/30 on the [Bug] Idle scheduler thread independently latches PTO2_ERROR_SCHEDULER_TIMEOUT, causing fatal cascade in distributed runs #897 repro (combined with Fix #897: cross-chip init barrier so peer-rank init skew never charges against per-rank stream sync budget #935) — no evidence the budget needs to differ per-fabric. Per.claude/rules/discipline.mdand CLAUDE.md, adding a config surface without a workload that requires it is premature. Easy follow-up if a future fabric trips the default.Update: #935 merged (2026-05-31)
The companion PR #935 (chip-init barrier) has been merged. This PR is independent (no file overlap) and lands the AICPU-side correctness fix on top. The Verification table above showed 30/30 with both fixes composed; with #935 already in main, #930 alone is sufficient to keep the #897 cascade closed in steady state and to handle any new init-skew sources that may arise later.