Add: port comm + deferred completion to a5 onboard#823
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements the HCCL backend for distributed communication on the a5 platform, replacing previous stubs with a functional implementation using ACL IPC primitives. It introduces a symmetric memory pool, updates the DeviceRunner for ACL lifecycle management, and refactors the runtime scheduler to support both counter-based and SDMA event record completion types. Additionally, header guards are modernized to "#pragma once" across several files. Feedback identifies a high-severity issue in the scheduler where the async_ctx.completion_entries array lacks necessary cache invalidation before processing, potentially leading to stale data reads from Global Memory.
2b35678 to
4de3393
Compare
- Mirror comm_hccl.cpp from a2a3 onboard host (HCCL backend with DIY
IPC windows). SDMA workspace overlay is added in the follow-up
commit so this base alone does not depend on PTO_ISA_ROOT or
libnnopbase, and does not invoke aclnnShmemSdmaStarsQuery at
comm_init -- which keeps non-SDMA comm demos unaffected by the
current CANN-9.x SDMA-on-a5 gap.
- Graft ensure_acl_ready / create_comm_stream / destroy_comm_stream
into a5 DeviceRunner and gate aclrtResetDevice + aclFinalize on
acl_ready_ in finalize(); preserve raw rtDeviceReset for pure
rt-layer callers.
- Replace pto_runtime_c_api.cpp comm/ACL stubs with forwarding
implementations; comm_* C ABI now comes from comm_hccl.cpp.
- Upgrade a5 trb deferred-completion runtime from counter-only to
pluggable backend-ops design: CompletionCondition gains
completion_type/addr/retired fields, CompletionBackendOps table
routes COMPLETION_TYPE_{COUNTER,SDMA_EVENT_RECORD}, scheduler
invalidates counter cache lines before polling and retires
satisfied conditions.
- Copy backend/sdma/{kernel,scheduler}.h to a5 (kernel-side, dormant
until a kernel registers a SDMA condition; a5 pto-isa already
exposes SDMA via PTO_NPU_ARCH_A5).
- a5 onboard CMakeLists adds hcomm find_library (FATAL_ERROR on
miss).
- Fix Stride ambiguity in async_notify_demo kernels (pto:: qualifier
to disambiguate from bisheng's enum class Stride).
- Enable a5 in allreduce_distributed and test_platform_comm platform
marks; parametrize the latter via st_platform.
- Convert ported runtime headers to #pragma once on both arches so
aicore_completion_mailbox.h / pto_completion_token.h /
pto_async_{wait,kernel_api}.h / backend/sdma/*.h are now byte-
identical across a2a3 and a5.
Verified: a2a3 onboard, a5 onboard, a5 sim trb runtime builds all
clean. No hardware tests run.
The orch-only allocate_domain path (hw-native-sys#817) dropped the comm_alloc_windows step, which was the only caller of aclrtDeviceEnablePeerAccess. domain_alloc_via_ipc still skipped enabling P2P on the now-false assumption the base alloc did it, so on a5/CANN-9.x the device-pair route is never opened. The IPC VA import still succeeds (host setup + the alloc/release UT pass), but kernel-level cross-chip writes never land, so peer TWAIT/notification waits spin until PTO2_ERROR_SCHEDULER_TIMEOUT (surfaced as ACL 507018). Add the EnablePeerAccess + PeerAccessStatus poll loop (idempotent, per device-pair) to domain_alloc_via_ipc, mirroring alloc_windows_via_ipc. Applied to both a5 and a2a3 backends (kept byte-identical). a2a3 does not manifest the bug -- it enables the route implicitly -- so its hunk is a defensive safety net. Verified on a5 (cards 2,3): allreduce_distributed[2] and async_notify_demo go from timeout to PASS; comm UTs still PASS. a2a3 (cards 8,9): allreduce[2] PASS pre- and post-fix.
a5 onboard CI exposes only 2 NPUs. The 4-rank allreduce case (device_count(4)) trips the resource-phase pre-flight static check (parallel_scheduler.py), which aborts the entire phase -- taking the 2-rank case (and every other L3 example job) down with it, so a5 onboard got zero L3 example coverage. Split the >2-rank case into its own function so a5 can be dropped via the function-level platforms mark (the harness deselects by that mark, not by per-param marks). 2-rank runs everywhere incl. a5 onboard; 4-rank stays on a2a3 hardware + both sims. Verified on a5 (cards 2,3): 4-rank deselected, 2-rank PASS, no abort.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning Review limit reached
More reviews will be available in 40 minutes and 57 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (26)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Verified: a2a3 onboard, a5 onboard, a5 sim trb runtime builds all clean. No hardware tests run.