Skip to content

Fix #897: gate AICPU scheduler timeout latch on owned RUNNING task + wall-clock budget#930

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:fix/scheduler-idle-thread-false-latch
May 31, 2026
Merged

Fix #897: gate AICPU scheduler timeout latch on owned RUNNING task + wall-clock budget#930
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:fix/scheduler-idle-thread-false-latch

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

@hw-native-sys-bot hw-native-sys-bot commented May 31, 2026

Fixes #897.

Summary

  • Replace per-thread iteration-count fatal timeout (>= MAX_IDLE_ITERATIONS) with a wall-clock budget (SCHEDULER_TIMEOUT_MS = 5000). Iteration count is per-thread spin-speed-sensitive — a pure-idle thread hits the cap ~2× faster than a sibling that polls a RUNNING task — so the fastest spinner racing ahead is what drove [Bug] Idle scheduler thread independently latches PTO2_ERROR_SCHEDULER_TIMEOUT, causing fatal cascade in distributed runs #897's false-positive cascade.
  • Gate the call to handle_timeout_exit on self_owns_running_task(thread_idx). A thread without an owned RUNNING task has no first-hand evidence its dispatch is stuck. If the wall-clock budget elapses on such a thread, refresh last_progress_ts and keep spinning; the existing STALL diagnostic still fires periodically, so observability is preserved.
  • Symmetric across src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/runtime/scheduler/.

Verification (on shared a2a3 runner via task-submit, locked NPU pair, identical env)

Repro = pypto's tests/st/distributed/test_l3_allreduce.py, the case named in #897, with the env vars #897 lists as triggers:

PTO2_RING_HEAP=4294967296
PTO2_RING_TASK_WINDOW=131072
PTO2_RING_DEP_POOL=131072
pass fail sched_error_code=100 count
baseline @ upstream/main (39c0445) 15 15 15
this PR 29 1 0

Why these two changes specifically (not the alternatives in #897 body)

#897's author proposed three options (skip-latch-if-idle / barrier counter / wall-clock budget). The minimal correct fix is both the idle gate AND the wall-clock budget — independently, neither is sufficient:

  • Idle gate alone: the thread that owns the RUNNING task would still latch on its own iteration count, which is slower than the idle spinners by ~2× but still tripping on legitimate HCCL waits > 92 ms (× per-thread spin rate). False positives just move from idle threads to polling threads.
  • Wall-clock alone: would correctly delay the fatal but still allow any thread to latch on no first-hand evidence. The 5 s budget still expires while a peer's HCCL wait is in flight.

Together: only a thread that both observed the global wall-clock budget elapse and owns a RUNNING task it has been polling will latch. The barrier-counter option (proposed #2 in the issue body) was not adopted — adding shared atomics on the fatal path is heavier than the gate-and-poll structure here, and the semantics around "all threads at idle limit" is harder to reason about during normal completion vs. true hang.

Out of scope

Test plan

🤖 Generated with Claude Code


Update: residual 1/30 also fixed by companion PR #935

After this PR, the same repro shows 29/30 pass; the 1 remaining fail is a host-side aclrtSyncStream(2 s) timeout — the outer timer in the same #897 cascade, exposed once the inner scheduler latch is suppressed. Root cause: ChipWorker.init skew across chip_processes bleeds peer-rank init time into per-rank op sync budget.

#935 adds a cross-chip init barrier (MailboxState::INIT_DONE + parent spin-wait) that eliminates the skew at source. With #930 + #935, the same 30-round repro shows 30/30 pass, zero sched_error_code=100, zero host stream sync timeouts.

Two PRs because the layers are genuinely independent:

Both should land. Order doesn't matter; no file overlap.

Related


Update: CodeRabbit feedback addressed (commit 008b787)


Update: #935 merged (2026-05-31)

The companion PR #935 (chip-init barrier) has been merged. This PR is independent (no file overlap) and lands the AICPU-side correctness fix on top. The Verification table above showed 30/30 with both fixes composed; with #935 already in main, #930 alone is sufficient to keep the #897 cascade closed in steady state and to handle any new init-skew sources that may arise later.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 31, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR addresses false-positive scheduler timeout crashes in distributed runs by replacing iteration-count timeouts with wall-clock timeouts and gating fatal-latch to threads that own cores with active RUNNING tasks. Both a2a3 and a5 scheduler variants receive identical changes.

Changes

Scheduler Timeout with Ownership Gating

Layer / File(s) Summary
Timeout constants and configuration
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h, src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h
SCHEDULER_TIMEOUT_MS (5000ms) and SCHEDULER_TIMEOUT_CYCLES constants replace per-iteration caps, enabling wall-clock measurement to accommodate distributed startup skew.
Ownership check helper
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h, src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp, src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h, src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
self_owns_running_task(thread_idx) method declared and implemented in both variants to check if a thread owns any core with non-null running_slot_state, gating the fatal condition.
Wall-clock timeout dispatch logic
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp, src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
Per-thread last_progress_ts tracking replaces iteration count; cycle-budget timeout only calls handle_timeout_exit when self_owns_running_task(thread_idx) is true, otherwise refreshes timestamp and continues spinning to prevent idle peer threads from triggering false fatal crashes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • hw-native-sys/simpler#872: Extends handle_timeout_exit to accept last-progress for shutdown stall snapshots; complements the wall-clock timeout trigger changes.

Poem

🐰 A scheduler once counted its spins,
Till idle threads crashed with timeout chins.
Now wall-clock and care—
Ownership fair—
No false latch, and everyone wins!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title directly reflects the two main changes: gating AICPU scheduler timeout latch on owned RUNNING task and implementing a wall-clock budget, which align with the changeset's core modifications.
Linked Issues check ✅ Passed The code changes fully implement the linked issue #897 objectives: adding self_owns_running_task() gate, replacing iteration-count timeout with wall-clock budget (SCHEDULER_TIMEOUT_MS/SCHEDULER_TIMEOUT_CYCLES), and gating handle_timeout_exit on ownership checks.
Out of Scope Changes check ✅ Passed All modifications are directly scoped to the linked issue #897: new helper method, timeout constant definitions, and dispatch logic changes for the gate-and-wall-clock mechanism. No unrelated refactoring or feature additions detected.
Description check ✅ Passed The pull request description is highly detailed and directly related to the changeset, explaining the rationale for replacing iteration-count timeouts with wall-clock budgets and gating timeout latches on owned RUNNING tasks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the per-thread iteration-count cap for declaring a scheduler timeout with a wall-clock budget of 5 seconds across the a2a3 and a5 runtimes. It introduces the self_owns_running_task helper to ensure that only threads owning a core with an active RUNNING task can trigger a global fatal timeout, preventing false-positive latches caused by startup skew or idle spinning in distributed environments. There are no review comments, and we have no additional feedback to provide.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`:
- Around line 766-787: The timeout suppression currently resets last_progress_ts
whenever the local thread does not own RUNNING work, which hides global
deadlocks when completed_tasks_ < total_tasks_ and no thread owns any RUNNING
slot; change the branch so that before simply refreshing last_progress_ts you
check for the global no-RUNNING condition and, if true and tasks remain, invoke
the same fatal path (handle_timeout_exit(...)) for this thread. Concretely:
inside the if (get_sys_cnt_aicpu() - last_progress_ts >
SCHEDULER_TIMEOUT_CYCLES) block, after the self_owns_running_task(thread_idx)
check fails, add a check like "if (completed_tasks_ < total_tasks_ &&
no_thread_owns_running()) return handle_timeout_exit(...)" (implement
no_thread_owns_running by scanning scheduler thread slots or using the existing
global running-count helper), otherwise keep updating last_progress_ts as
before.

In
`@src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h`:
- Around line 63-65: Replace the hardcoded constexpr SCHEDULER_TIMEOUT_MS (5000)
with a configurable knob: read a compile-time macro PTO2_SCHED_TIMEOUT_MS (fall
back to 5000 if undefined) or a runtime-config value and compute
SCHEDULER_TIMEOUT_CYCLES from that value (using PLATFORM_PROF_SYS_CNT_FREQ as
before); update uses of SCHEDULER_TIMEOUT_MS and SCHEDULER_TIMEOUT_CYCLES so
they reference the new configurable symbol (e.g., PTO2_SCHED_TIMEOUT_MS or a
getter like get_scheduler_timeout_ms()) and ensure the multiplication still
casts to uint64_t to avoid overflow.

In
`@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp`:
- Around line 762-783: The timeout reset currently suppresses fatal timeouts
when no thread owns a RUNNING slot, which hides true deadlocks; update the
timeout branch so that after detecting get_sys_cnt_aicpu() - last_progress_ts >
SCHEDULER_TIMEOUT_CYCLES you still call handle_timeout_exit(...) when there are
unfinished tasks and no scheduler thread owns any RUNNING slot. Concretely: keep
the existing self_owns_running_task(thread_idx) -> handle_timeout_exit(...)
path, add an else-if that checks (completed_tasks_ < total_tasks_) and that no
thread owns a RUNNING slot (implement by scanning the scheduler state or adding
a helper like any_thread_owns_running() if none exists), and call
handle_timeout_exit(...) in that case; only when neither condition holds should
you refresh last_progress_ts = get_sys_cnt_aicpu(). Ensure you use the same
handle_timeout_exit signature and existing symbols (last_progress_ts,
SCHEDULER_TIMEOUT_CYCLES, get_sys_cnt_aicpu(), completed_tasks_, total_tasks_,
self_owns_running_task(thread_idx)).

In `@src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h`:
- Around line 64-66: The timeout is hardcoded via constexpr SCHEDULER_TIMEOUT_MS
and SCHEDULER_TIMEOUT_CYCLES; change these to be computed at runtime and
configurable (e.g., via environment variable or a runtime config API) with a
sensible default of 5000 ms. Replace constexpr SCHEDULER_TIMEOUT_MS with a
function or non-constexpr getter (e.g., getSchedulerTimeoutMs()) that reads a
config/env var and falls back to 5000, and recompute SCHEDULER_TIMEOUT_CYCLES
from that runtime value using PLATFORM_PROF_SYS_CNT_FREQ (e.g.,
getSchedulerTimeoutCycles() = uint64_t(getSchedulerTimeoutMs()) *
(PLATFORM_PROF_SYS_CNT_FREQ / 1000)). Ensure callers use the new getter(s)
instead of the removed constexpr symbols.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 89487ddc-5ded-4b95-8ece-70ff7393e82e

📥 Commits

Reviewing files that changed from the base of the PR and between 39c0445 and 2bf0d35.

📒 Files selected for processing (8)
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_types.h

Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpp Outdated
@ChaoWao ChaoWao force-pushed the fix/scheduler-idle-thread-false-latch branch 2 times, most recently from 1e50009 to 008b787 Compare May 31, 2026 06:35
ChaoWao added a commit that referenced this pull request May 31, 2026
…s against the per-rank stream sync budget (#935)

ChipWorker.init has a long right tail under PTO2_RING_HEAP=4 GiB
(per-rank device_malloc(16 GiB) pulls host long-tail into the critical
path). Without coordination, a fast-init chip starts its first
launch_aicore_kernel + aclrtSyncStream(PLATFORM_STREAM_SYNC_TIMEOUT_MS)
window N seconds before a slower peer reaches the same point. Any
cross-rank wait inside the dispatched op (HCCL notify, distributed
allreduce, etc.) charges the slow peer's remaining init time against
the fast peer's 2 s host stream sync budget — the cascade observed as
the residual 1/30 fail after PR #930.

Add a new MailboxState::INIT_DONE (value 6). Chip child writes it once
ChipWorker.init returns; parent's _start_hierarchical spin-waits for
every chip child to reach INIT_DONE before dispatching anything
(CTRL_PREPARE or TASK_READY), then resets each child to IDLE so the
standard dispatch state machine resumes from the canonical "ready for
work" state.

Effect: all chip childrens' first aclrtSyncStream windows open at the
same wall-clock instant, so the 2 s budget covers only actual op
execution rather than absorbing peer init skew.

## Verification

Same task-submit + locked-NPU + 30-round protocol as PR #930, on
pypto's tests/st/distributed/test_l3_allreduce.py with the issue
#897 trigger env (PTO2_RING_HEAP=4 GiB, PTO2_RING_TASK_WINDOW=131072,
PTO2_RING_DEP_POOL=131072).

| variant              | pass | fail | sched_error_code=100 | host stream sync timeout |
| -------------------- | ---- | ---- | -------------------- | ------------------------ |
| upstream/main        | 15   | 15   | 15                   | (subsumed)               |
| PR #930 alone        | 29   | 1    | 0                    | 1                        |
| PR #930 + this PR    | 30   | 0    | 0                    | 0                        |

Composes with PR #930 — they address different layers of the same
distributed-init-skew failure mode:
- PR #930 eliminates the false-positive AICPU scheduler latch
  (`sched_error_code=100`) that fires from any idle thread at ~92 ms.
- This PR eliminates the underlying skew so the outer host-side timer
  (`PLATFORM_STREAM_SYNC_TIMEOUT_MS = 2000`) also has no input event
  to fire on.

Either fix alone leaves a residual; both together drop the failure
rate to 0 under the issue's reproduction conditions.

## Scope

The same `INIT_DONE` signaling could be wired into `_sub_worker_loop`
and `_child_worker_loop` (L4+) for symmetry. Skipped here because
sub workers do no device init (Python-only loops) and the L4+ inner
Worker race is unobserved; addressing it would be a follow-up keyed
to its own repro.

Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
…NNING task + wall-clock budget + deadlock fallback

Issue hw-native-sys#897 documents an idle-thread false-latch in the AICPU scheduler:
any thread reaching MAX_IDLE_ITERATIONS (~92 ms wall at ~115 ns/iter)
unconditionally writes sched_error_code=PTO2_ERROR_SCHEDULER_TIMEOUT (100)
to the shared header, which sibling threads see in handle_orchestrator_exit
and abort. In distributed runs this routinely kills the runtime even when
the only thread with a RUNNING task is making correct progress, because a
peer rank's host-side device_malloc tail-latency easily exceeds 92 ms.

Three changes, applied symmetrically to a2a3 and a5 tensormap_and_ringbuffer:

1. Replace the per-thread iteration-count timeout with a wall-clock budget
   (SCHEDULER_TIMEOUT_MS=5000 → SCHEDULER_TIMEOUT_CYCLES via the existing
   50 MHz sys-cnt frequency). Iteration counts run at per-thread spin speed:
   a pure-idle thread hits MAX_IDLE_ITERATIONS in ~92 ms while a sibling
   that polls a COND register on a RUNNING task takes ~200 ms for the same
   iteration count. Letting the fast spinner race ahead and declare global
   fatal kills the slower-but-correct poller mid-poll. Wall-clock unifies
   the deadline. STALL_LOG_INTERVAL still drives per-thread observability
   on iteration count (diagnostic cadence, not load-bearing).

2. Gate the call to handle_timeout_exit on self_owns_running_task(thread_idx).
   A thread without an owned RUNNING task has no first-hand evidence of a
   stuck dispatch and must not declare global fatal on its own idle clock.
   If the wall-clock budget elapses on a thread with no evidence, refresh
   last_progress_ts so the next budget cycle starts fresh; the STALL
   diagnostic above still fires periodically so observability is preserved
   while the false-positive latch is suppressed. The thread that does own
   a stuck task will reach the budget on its own polls and latch with
   valid evidence (or recover when the COND register flips).

3. Second latch branch for "pre-dispatch / WAIT-only deadlock" (CodeRabbit
   feedback): if the wall-clock budget elapses AND no thread anywhere
   owns a RUNNING task AND tasks remain unfinished
   (completed_tasks_ < total_tasks_), latch. Without this, a genuine
   dependency-cycle deadlock with zero RUNNING work would spin forever —
   every ownerless thread would just refresh its timestamp and never
   latch, and the host-side aclrtSyncStream timeout is the only outer
   detector. The new no_thread_owns_running_task() helper makes the
   global-stuck condition cheap to test on the cold timeout path.

The 5 s budget is sized to cover worst-observed distributed-init skew
(reported as ~196 ms in hw-native-sys#897 round 12; PTO2_RING_HEAP=4 GiB pushes
host-side device_malloc tail latency into the same range) plus typical
HCCL wait, with margin. PLATFORM_STREAM_SYNC_TIMEOUT_MS=2000 on the host
side remains as the outer hang detector for cases the AICPU never reaches.

## Verification

Repro on shared a2a3 runner via task-submit, locked to a clean pair of
NPUs, 30 rounds each, identical env from hw-native-sys#897:
  PTO2_RING_HEAP=4294967296 PTO2_RING_TASK_WINDOW=131072
  PTO2_RING_DEP_POOL=131072

Test: pypto's tests/st/distributed/test_l3_allreduce.py (the exact case
named in hw-native-sys#897). Baseline = upstream/main (pre-fix); fix = this PR's branch.

  baseline 30 rounds:                 pass=15 fail=15 sched_error_code=100=15
  this PR alone (initial):            pass=29 fail=1  sched_error_code=100=0
  this PR (with CodeRabbit changes):  pass=30 fail=0  sched_error_code=100=0
                                      (combined with hw-native-sys#935 chip-init barrier)

The 1/30 residual from the initial version was a host-side
aclrtSyncStream(2 s) timeout — the next outer timer down the stack,
independent of the scheduler latch this PR fixes. Combining with hw-native-sys#935
(cross-chip init barrier) eliminates that residual at source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ChaoWao ChaoWao force-pushed the fix/scheduler-idle-thread-false-latch branch from 008b787 to cbcc3a1 Compare May 31, 2026 06:47
@ChaoWao ChaoWao merged commit 0309e78 into hw-native-sys:main May 31, 2026
16 checks passed
@ChaoWao ChaoWao deleted the fix/scheduler-idle-thread-false-latch branch May 31, 2026 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Idle scheduler thread independently latches PTO2_ERROR_SCHEDULER_TIMEOUT, causing fatal cascade in distributed runs

2 participants