Add: port comm + deferred completion to a5 onboard by jvjhfhg · Pull Request #823 · hw-native-sys/simpler

jvjhfhg · 2026-05-19T12:22:42Z

Mirror comm_hccl.cpp from a2a3 onboard host (HCCL backend with DIY IPC windows). SDMA workspace overlay is added in the follow-up commit so this base alone does not depend on PTO_ISA_ROOT or libnnopbase, and does not invoke aclnnShmemSdmaStarsQuery at comm_init -- which keeps non-SDMA comm demos unaffected by the current CANN-9.x SDMA-on-a5 gap.
Graft ensure_acl_ready / create_comm_stream / destroy_comm_stream into a5 DeviceRunner and gate aclrtResetDevice + aclFinalize on acl_ready_ in finalize(); preserve raw rtDeviceReset for pure rt-layer callers.
Replace pto_runtime_c_api.cpp comm/ACL stubs with forwarding implementations; comm_* C ABI now comes from comm_hccl.cpp.
Upgrade a5 trb deferred-completion runtime from counter-only to pluggable backend-ops design: CompletionCondition gains completion_type/addr/retired fields, CompletionBackendOps table routes COMPLETION_TYPE_{COUNTER,SDMA_EVENT_RECORD}, scheduler invalidates counter cache lines before polling and retires satisfied conditions.
Copy backend/sdma/{kernel,scheduler}.h to a5 (kernel-side, dormant until a kernel registers a SDMA condition; a5 pto-isa already exposes SDMA via PTO_NPU_ARCH_A5).
a5 onboard CMakeLists adds hcomm find_library (FATAL_ERROR on miss).
Fix Stride ambiguity in async_notify_demo kernels (pto:: qualifier to disambiguate from bisheng's enum class Stride).
Enable a5 in allreduce_distributed and test_platform_comm platform marks; parametrize the latter via st_platform.
Convert ported runtime headers to #pragma once on both arches so aicore_completion_mailbox.h / pto_completion_token.h / pto_async_{wait,kernel_api}.h / backend/sdma/*.h are now byte- identical across a2a3 and a5.

Verified: a2a3 onboard, a5 onboard, a5 sim trb runtime builds all clean. No hardware tests run.

gemini-code-assist

Code Review

This pull request implements the HCCL backend for distributed communication on the a5 platform, replacing previous stubs with a functional implementation using ACL IPC primitives. It introduces a symmetric memory pool, updates the DeviceRunner for ACL lifecycle management, and refactors the runtime scheduler to support both counter-based and SDMA event record completion types. Additionally, header guards are modernized to "#pragma once" across several files. Feedback identifies a high-severity issue in the scheduler where the async_ctx.completion_entries array lacks necessary cache invalidation before processing, potentially leading to stale data reads from Global Memory.

- Mirror comm_hccl.cpp from a2a3 onboard host (HCCL backend with DIY IPC windows). SDMA workspace overlay is added in the follow-up commit so this base alone does not depend on PTO_ISA_ROOT or libnnopbase, and does not invoke aclnnShmemSdmaStarsQuery at comm_init -- which keeps non-SDMA comm demos unaffected by the current CANN-9.x SDMA-on-a5 gap. - Graft ensure_acl_ready / create_comm_stream / destroy_comm_stream into a5 DeviceRunner and gate aclrtResetDevice + aclFinalize on acl_ready_ in finalize(); preserve raw rtDeviceReset for pure rt-layer callers. - Replace pto_runtime_c_api.cpp comm/ACL stubs with forwarding implementations; comm_* C ABI now comes from comm_hccl.cpp. - Upgrade a5 trb deferred-completion runtime from counter-only to pluggable backend-ops design: CompletionCondition gains completion_type/addr/retired fields, CompletionBackendOps table routes COMPLETION_TYPE_{COUNTER,SDMA_EVENT_RECORD}, scheduler invalidates counter cache lines before polling and retires satisfied conditions. - Copy backend/sdma/{kernel,scheduler}.h to a5 (kernel-side, dormant until a kernel registers a SDMA condition; a5 pto-isa already exposes SDMA via PTO_NPU_ARCH_A5). - a5 onboard CMakeLists adds hcomm find_library (FATAL_ERROR on miss). - Fix Stride ambiguity in async_notify_demo kernels (pto:: qualifier to disambiguate from bisheng's enum class Stride). - Enable a5 in allreduce_distributed and test_platform_comm platform marks; parametrize the latter via st_platform. - Convert ported runtime headers to #pragma once on both arches so aicore_completion_mailbox.h / pto_completion_token.h / pto_async_{wait,kernel_api}.h / backend/sdma/*.h are now byte- identical across a2a3 and a5. Verified: a2a3 onboard, a5 onboard, a5 sim trb runtime builds all clean. No hardware tests run.

The orch-only allocate_domain path (hw-native-sys#817) dropped the comm_alloc_windows step, which was the only caller of aclrtDeviceEnablePeerAccess. domain_alloc_via_ipc still skipped enabling P2P on the now-false assumption the base alloc did it, so on a5/CANN-9.x the device-pair route is never opened. The IPC VA import still succeeds (host setup + the alloc/release UT pass), but kernel-level cross-chip writes never land, so peer TWAIT/notification waits spin until PTO2_ERROR_SCHEDULER_TIMEOUT (surfaced as ACL 507018). Add the EnablePeerAccess + PeerAccessStatus poll loop (idempotent, per device-pair) to domain_alloc_via_ipc, mirroring alloc_windows_via_ipc. Applied to both a5 and a2a3 backends (kept byte-identical). a2a3 does not manifest the bug -- it enables the route implicitly -- so its hunk is a defensive safety net. Verified on a5 (cards 2,3): allreduce_distributed[2] and async_notify_demo go from timeout to PASS; comm UTs still PASS. a2a3 (cards 8,9): allreduce[2] PASS pre- and post-fix.

a5 onboard CI exposes only 2 NPUs. The 4-rank allreduce case (device_count(4)) trips the resource-phase pre-flight static check (parallel_scheduler.py), which aborts the entire phase -- taking the 2-rank case (and every other L3 example job) down with it, so a5 onboard got zero L3 example coverage. Split the >2-rank case into its own function so a5 can be dropped via the function-level platforms mark (the harness deselects by that mark, not by per-param marks). 2-rank runs everywhere incl. a5 onboard; 4-rank stays on a2a3 hardware + both sims. Verified on a5 (cards 2,3): 4-rank deselected, 2-rank PASS, no abort.

gemini-code-assist · 2026-05-29T06:33:51Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-05-29T06:33:54Z

Warning

Review limit reached

@jvjhfhg, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 40 minutes and 57 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9891fc70-b350-406d-b26d-cfb581508065

📥 Commits

Reviewing files that changed from the base of the PR and between c22db51 and 8b4e3e7.

📒 Files selected for processing (26)

examples/a5/tensormap_and_ringbuffer/async_notify_demo/kernels/aiv/kernel_consumer.cpp
examples/a5/tensormap_and_ringbuffer/async_notify_demo/kernels/aiv/kernel_producer_notify.cpp
examples/workers/l3/allreduce_distributed/test_allreduce.py
src/a2a3/platform/onboard/host/comm_hccl.cpp
src/a2a3/platform/onboard/host/device_runner.cpp
src/a2a3/platform/onboard/host/device_runner.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/backend/sdma/sdma_completion_kernel.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/backend/sdma/sdma_completion_scheduler.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_async_kernel_api.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.h
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_completion_token.h
src/a5/platform/onboard/host/CMakeLists.txt
src/a5/platform/onboard/host/comm_hccl.cpp
src/a5/platform/onboard/host/device_runner.cpp
src/a5/platform/onboard/host/device_runner.h
src/a5/platform/onboard/host/pto_runtime_c_api.cpp
src/a5/runtime/tensormap_and_ringbuffer/runtime/aicore_completion_mailbox.h
src/a5/runtime/tensormap_and_ringbuffer/runtime/backend/sdma/sdma_completion_kernel.h
src/a5/runtime/tensormap_and_ringbuffer/runtime/backend/sdma/sdma_completion_scheduler.h
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_async_kernel_api.h
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.h
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_completion_token.h
src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
tests/ut/py/test_worker/test_dynamic_alloc_hw.py
tests/ut/py/test_worker/test_platform_comm.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_async_wait.h

jvjhfhg force-pushed the feat/comm-a5 branch 13 times, most recently from 2b35678 to 4de3393 Compare May 26, 2026 08:35

jvjhfhg marked this pull request as draft May 28, 2026 07:19

jvjhfhg added 3 commits May 29, 2026 14:30

jvjhfhg marked this pull request as ready for review May 29, 2026 06:33

jvjhfhg force-pushed the feat/comm-a5 branch from 4de3393 to 8b4e3e7 Compare May 29, 2026 06:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: port comm + deferred completion to a5 onboard#823

Add: port comm + deferred completion to a5 onboard#823
jvjhfhg wants to merge 3 commits into
hw-native-sys:mainfrom
jvjhfhg:feat/comm-a5

jvjhfhg commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot commented May 29, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

Review limit reached

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jvjhfhg commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot commented May 29, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 29, 2026 •

edited

Loading