Skip to content

Fix: drop 0xDEADBEEF Ctrl regs placeholder + retry halMemCtl EACCES race#925

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:fix/halmemctl-dev11-race
May 31, 2026
Merged

Fix: drop 0xDEADBEEF Ctrl regs placeholder + retry halMemCtl EACCES race#925
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:fix/halmemctl-dev11-race

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

@hw-native-sys-bot hw-native-sys-bot commented May 30, 2026

Summary

  • Drop the 0xDEADBEEF placeholder fallback in host_regs.cpp (a2a3 + a5). The dispatch path actually dereferences these via write_reg/read_reg in platform_init_aicore_regs / platform_deinit_aicore_regs — placeholder fill silently produced a 2 s AICPU stream sync timeout (507046) on the affected chip. Propagate HAL rc instead.
  • Add a bounded retry on halMemCtl rc=13 (EACCES) in the a2a3 path (50 ms × 3). Concurrent chip_process fork on the 4-device runner consistently loses one driver-side serialization window — empirically always on dev=11. The retry absorbs the race without masking permanent failure modes.

Root cause

PR #710 introduced the placeholder fallback with the claim "the dispatch path does not actually dereference these addresses" — that claim is wrong. src/a2a3/platform/src/aicpu/platform_regs.cpp:62-91 (platform_init_aicore_regs / platform_deinit_aicore_regs) issues raw MMIO via write_reg/read_reg. With placeholder addresses, the AICore handshake never completes; the AICPU stream stalls and the host surfaces ACL_ERROR_RT_STREAM_SYNC_TIMEOUT (507046) after PLATFORM_STREAM_SYNC_TIMEOUT_MS=2000.

The upstream EACCES has its own cause: when N chip_processes for the same a2a3 runner fork concurrently (4-device *_distributed[4] tests are the most reliable trigger), a narrow driver serialization window for halMemCtl(ADDR_MAP_TYPE_REG_AIC_CTRL) returns 13 on one of them. The failure has been consistently on dev=11 across 2026-05-29 and 2026-05-30 CI runs on multiple branches.

This is independent of PR #890's per-device staged-SO fix — that addressed a different 507046 path (paired-die simpler_inner_<fp>.so file race producing a simpler_aicpu_exec 0x2a cascade). Same surface code, different mechanism.

Symptom evidence (pre-fix)

[ERROR] get_aicore_reg_info: halMemCtl failed with rc=13
[ERROR] get_aicore_regs(AIC_CTRL): halMemCtl failed: 13, using placeholder addresses
[ERROR] run: Stream sync timeout: stream=AICPU timeout_ms=2000 device_id=11
[ERROR] destroy_comm_stream: aclrtSynchronizeStream during stream teardown failed: 507901
RuntimeError: run_prepared failed with code 507046

(Sample: `st-onboard-a2a3` 2026-05-29 09:13 / 2026-05-30 09:23 / 09:24.)

Why this surfaces as a Spmd/sdma/distributed flake

507046 is consistently surfaced as the failing test's first error, but the actual victim is whichever case happened to be running on dev=11 when the race fires; subsequent cases on the borked chip cascade with 507899/507901 from aclrtSynchronizeStream during teardown.

Scope

  • a2a3 onboard path only is touched for the retry (where the symptom has been observed).
  • a5 gets the placeholder removal symmetrically; the void → int return on get_aicore_regs closes a dead-code path (the prior host_regs.empty() guard never fired because get_aicore_reg_info pre-resizes the vector).

Test plan

  • st-onboard-a2a3 passes on this PR (CI)
  • Manual: re-run the *_distributed[4] family ≥10× on the runner; expect zero 507046 with device_id=11 (was: ~3% in the 200 most recent runs sampled before fix)
  • Sanity: ut-a2a3 + ut-a5 (host_regs is reachable from offline tooling)

🤖 Generated with Claude Code


Relationship to other open PRs

This PR addresses a different 507046 path from the #897 family:

No file overlap with #930 or #935. Safe to land independently in any order.

PR hw-native-sys#710 added a silent fallback to 0xDEADBEEF placeholder addresses in
get_aicore_regs(AIC_CTRL) when halMemCtl rejected the query, on the
claim that "the dispatch path does not actually dereference these
addresses". The claim is wrong: platform_init_aicore_regs and
platform_deinit_aicore_regs do raw MMIO writes/reads through these
addresses (FAST_PATH_ENABLE, DATA_MAIN_BASE, COND), so any chip that
hit the fallback dispatched its first AICore task to 0xDEADBEEF —
AICore never reached FAST_PATH_OPEN, the AICPU stream hung, and the
host surfaced ACL_ERROR_RT_STREAM_SYNC_TIMEOUT (507046) after ~2 s.
This matches the recurring `device_id=11` stream-sync timeout flake
on st-onboard-a2a3 (e.g. 2026-05-29 sdma_async_completion_demo, the
2026-05-30 *_distributed[4] sequence).

The placeholder fix is symmetric across a2a3 + a5: propagate the HAL
rc instead of synthesizing addresses. On a5, get_aicore_regs gains an
int return type — the prior `host_regs.empty()` guard never fired
because get_aicore_reg_info pre-resizes the vector.

The upstream rc=13 (EACCES) on a2a3 has its own root cause: when 4
chip_processes for the same a2a3 runner fork concurrently (4-device
distributed cases), a narrow driver-side serialization window for
halMemCtl(AIC_CTRL) drops one request — empirically always dev=11.
A short bounded retry (50 ms × 3) absorbs the race window without
masking permanent failure modes. Only on a2a3; a5 uses halResMap and
has not exhibited the symptom.

PR hw-native-sys#890's per-device staged-SO naming fix addressed a different
507046 path (paired-die simpler_inner_<fp>.so file race producing a
simpler_aicpu_exec 0x2a cascade). The two bugs share the surface
code but are independent — both need to be present to keep the flake
rate down to zero.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 30, 2026

Warning

Review limit reached

@hw-native-sys-bot, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 30 minutes and 29 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f3cb38e3-7cac-402a-b263-14f963b07090

📥 Commits

Reviewing files that changed from the base of the PR and between c830d3a and 29355bd.

📒 Files selected for processing (2)
  • src/a2a3/platform/onboard/host/host_regs.cpp
  • src/a5/platform/onboard/host/host_regs.cpp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the fallback placeholder address generation logic upon HAL failure in both a2a3 and a5 platform implementations of get_aicore_regs, opting instead to propagate the failure code unconditionally to prevent potential deadlocks. Additionally, a retry mechanism has been introduced in a2a3's get_aicore_reg_info to handle transient EACCES (rc=13) errors during concurrent chip bring-up. There are no review comments, so we have no feedback to provide.

@ChaoWao ChaoWao merged commit 31d5409 into hw-native-sys:main May 31, 2026
16 checks passed
@ChaoWao ChaoWao deleted the fix/halmemctl-dev11-race branch May 31, 2026 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants