Fix: drop 0xDEADBEEF Ctrl regs placeholder + retry halMemCtl EACCES race#925
Conversation
PR hw-native-sys#710 added a silent fallback to 0xDEADBEEF placeholder addresses in get_aicore_regs(AIC_CTRL) when halMemCtl rejected the query, on the claim that "the dispatch path does not actually dereference these addresses". The claim is wrong: platform_init_aicore_regs and platform_deinit_aicore_regs do raw MMIO writes/reads through these addresses (FAST_PATH_ENABLE, DATA_MAIN_BASE, COND), so any chip that hit the fallback dispatched its first AICore task to 0xDEADBEEF — AICore never reached FAST_PATH_OPEN, the AICPU stream hung, and the host surfaced ACL_ERROR_RT_STREAM_SYNC_TIMEOUT (507046) after ~2 s. This matches the recurring `device_id=11` stream-sync timeout flake on st-onboard-a2a3 (e.g. 2026-05-29 sdma_async_completion_demo, the 2026-05-30 *_distributed[4] sequence). The placeholder fix is symmetric across a2a3 + a5: propagate the HAL rc instead of synthesizing addresses. On a5, get_aicore_regs gains an int return type — the prior `host_regs.empty()` guard never fired because get_aicore_reg_info pre-resizes the vector. The upstream rc=13 (EACCES) on a2a3 has its own root cause: when 4 chip_processes for the same a2a3 runner fork concurrently (4-device distributed cases), a narrow driver-side serialization window for halMemCtl(AIC_CTRL) drops one request — empirically always dev=11. A short bounded retry (50 ms × 3) absorbs the race window without masking permanent failure modes. Only on a2a3; a5 uses halResMap and has not exhibited the symptom. PR hw-native-sys#890's per-device staged-SO naming fix addressed a different 507046 path (paired-die simpler_inner_<fp>.so file race producing a simpler_aicpu_exec 0x2a cascade). The two bugs share the surface code but are independent — both need to be present to keep the flake rate down to zero. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 30 minutes and 29 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request removes the fallback placeholder address generation logic upon HAL failure in both a2a3 and a5 platform implementations of get_aicore_regs, opting instead to propagate the failure code unconditionally to prevent potential deadlocks. Additionally, a retry mechanism has been introduced in a2a3's get_aicore_reg_info to handle transient EACCES (rc=13) errors during concurrent chip bring-up. There are no review comments, so we have no feedback to provide.
Summary
0xDEADBEEFplaceholder fallback inhost_regs.cpp(a2a3 + a5). The dispatch path actually dereferences these viawrite_reg/read_reginplatform_init_aicore_regs/platform_deinit_aicore_regs— placeholder fill silently produced a 2 sAICPU stream sync timeout(507046) on the affected chip. Propagate HAL rc instead.halMemCtlrc=13 (EACCES) in the a2a3 path (50 ms × 3). Concurrent chip_process fork on the 4-device runner consistently loses one driver-side serialization window — empirically always ondev=11. The retry absorbs the race without masking permanent failure modes.Root cause
PR #710 introduced the placeholder fallback with the claim "the dispatch path does not actually dereference these addresses" — that claim is wrong.
src/a2a3/platform/src/aicpu/platform_regs.cpp:62-91(platform_init_aicore_regs/platform_deinit_aicore_regs) issues raw MMIO viawrite_reg/read_reg. With placeholder addresses, the AICore handshake never completes; the AICPU stream stalls and the host surfacesACL_ERROR_RT_STREAM_SYNC_TIMEOUT(507046) afterPLATFORM_STREAM_SYNC_TIMEOUT_MS=2000.The upstream EACCES has its own cause: when N chip_processes for the same a2a3 runner fork concurrently (4-device
*_distributed[4]tests are the most reliable trigger), a narrow driver serialization window forhalMemCtl(ADDR_MAP_TYPE_REG_AIC_CTRL)returns 13 on one of them. The failure has been consistently ondev=11across 2026-05-29 and 2026-05-30 CI runs on multiple branches.This is independent of PR #890's per-device staged-SO fix — that addressed a different 507046 path (paired-die
simpler_inner_<fp>.sofile race producing asimpler_aicpu_exec0x2a cascade). Same surface code, different mechanism.Symptom evidence (pre-fix)
(Sample: `st-onboard-a2a3` 2026-05-29 09:13 / 2026-05-30 09:23 / 09:24.)
Why this surfaces as a Spmd/sdma/distributed flake
507046 is consistently surfaced as the failing test's first error, but the actual victim is whichever case happened to be running on
dev=11when the race fires; subsequent cases on the borked chip cascade with 507899/507901 fromaclrtSynchronizeStreamduring teardown.Scope
void → intreturn onget_aicore_regscloses a dead-code path (the priorhost_regs.empty()guard never fired becauseget_aicore_reg_infopre-resizes the vector).Test plan
st-onboard-a2a3passes on this PR (CI)*_distributed[4]family ≥10× on the runner; expect zero 507046 withdevice_id=11(was: ~3% in the 200 most recent runs sampled before fix)ut-a2a3+ut-a5(host_regs is reachable from offline tooling)🤖 Generated with Claude Code
Relationship to other open PRs
This PR addresses a different 507046 path from the #897 family:
halMemCtlrace ondev=11→ placeholder dispatch addresses → AICPU stream sync timeout (507046).st-onboard-a2a3specific; independent of distributed runs.sched_error_code=100); the inner timer in the [Bug] Idle scheduler thread independently latches PTO2_ERROR_SCHEDULER_TIMEOUT, causing fatal cascade in distributed runs #897 cascade.No file overlap with #930 or #935. Safe to land independently in any order.