Fix: aicpu-kernel-launch hardware end-to-end issues#929
Conversation
📝 WalkthroughWalkthroughThis PR updates the aicpu-kernel-launch example to change the device-side HAL query from ChangesHAL Query Update: CORE_NUM to OS_SCHED
Possibly Related PRs
Estimated Code Review Effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the AICPU kernel launch example to query AICPU + OS_SCHED instead of AICPU + CORE_NUM via halGetDeviceInfo, as the latter is restricted inside AICPU OS processes. It also modifies the host-side code to use mkstemps to ensure the temporary JSON file has a .json extension, which is required by CANN's rtsBinaryLoadFromFile. Feedback on the changes points out a potential portability issue in launch_hello.cpp where got.hal_value is cast to long and printed with %lx, which can cause truncation on platforms where long is 32-bit (e.g., Windows). It is recommended to cast to unsigned long long and use %llx instead.
Two bugs surfaced when first running launch_hello on a3 hardware.
Neither was caught by the link-smoke CI (the link smoke compiles +
links the binary, doesn't actually run it).
1. `rtsBinaryLoadFromFile` returned rc=107000 on the JSON descriptor.
Root cause: I switched the temp file from a predictable
/tmp/simpler_inner_<fp>_<pid>.json (which the working
aicpu-device-query uses) to plain `mkstemp("/tmp/simpler_inner_XXXXXX")`
with no .json suffix. CANN's rtsBinaryLoadFromFile rejects paths
without the .json extension. Switched to `mkstemps("/tmp/simpler_inner_XXXXXX.json", 5)`
which keeps the suffix while still randomising the basename
(the security-against-symlink-attacks property the original
mkstemp change introduced).
2. The HAL call `halGetDeviceInfo(AICPU, CORE_NUM)` returns rc=3
from inside an AICPU OS process — the query is "used in device"
restricted on both a3 and a5. My printf in launch_hello claimed
the expected output was `rc=0 val=6` referencing host-side
rtGetAiCpuCount, but that count is only host-readable. Switched
the device kernel to `halGetDeviceInfo(AICPU, OS_SCHED)`, which
returns 0x1 on both a3 and a5 (AICPU OS owns cpu_id 0, per the
device-side probe writeup in src/{a2a3,a5}/docs/hardware.md).
The chosen query now matches a documented device-side fact, the
demo's HAL output line is meaningful, and the printf can mark
the result `OK` when it matches `0x1`.
Verified end-to-end on a3 (Ascend910_9392, device 4) via task-submit:
=== device=4 hello_aicpu HelloResult ===
magic = 0xdeadbeefc0ffee01 OK
echoed_token = 0xa58fd475f00dbeef OK
hal AICPU+OS_SCHED rc=0 val=0x1 OK
README updated: pipeline diagram now shows `halGetDeviceInfo(AICPU,
OS_SCHED) -> 0x1` and the expected-output sample reflects the new
HAL call + adds a paragraph pointing at the device-side probe
writeup for why 0x1 is the right answer.
e39f965 to
5f537f8
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tools/cann-examples/aicpu-kernel-launch/device/hello_aicpu.cpp (1)
34-34:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winStale
CORE_NUMreference in the output schema comment.This
hal_rcdoc comment still describes theCORE_NUMquery while the kernel now issuesOS_SCHED. Since this file is explicitly a copy-paste template, the leftover reference is misleading.📝 Proposed fix
-// int32 hal_rc // halGetDeviceInfo(AICPU, CORE_NUM) rc +// int32 hal_rc // halGetDeviceInfo(AICPU, OS_SCHED) rc🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/cann-examples/aicpu-kernel-launch/device/hello_aicpu.cpp` at line 34, Update the stale comment describing the hal_rc output schema: replace the mention of CORE_NUM with the current query name OS_SCHED and adjust the brief description accordingly so hal_rc documents the result of halGetDeviceInfo(AICPU, OS_SCHED) (or the equivalent kernel OS_SCHED query) instead of CORE_NUM; locate the comment that references hal_rc in hello_aicpu.cpp and edit the text to accurately reflect OS_SCHED semantics.tools/cann-examples/aicpu-kernel-launch/README.md (1)
106-106:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winStale
CORE_NUMreference in the I/O contract block.The
hal_rcfield comment still referencesCORE_NUM, which is now inconsistent with the rest of this README (pipeline diagram, expected output) after the OS_SCHED migration.📝 Proposed fix
- int32_t hal_rc; // halGetDeviceInfo(AICPU, CORE_NUM) rc + int32_t hal_rc; // halGetDeviceInfo(AICPU, OS_SCHED) rc🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/cann-examples/aicpu-kernel-launch/README.md` at line 106, The comment for the hal_rc variable is stale — it references CORE_NUM but the code/README now uses the OS_SCHED migration; update the hal_rc comment to reflect the correct halGetDeviceInfo parameter (replace CORE_NUM with the current constant used after the OS_SCHED migration, e.g., OS_SCHED) so the comment on hal_rc and the halGetDeviceInfo(AICPU, ...) call matches the pipeline diagram and expected output.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@tools/cann-examples/aicpu-kernel-launch/device/hello_aicpu.cpp`:
- Line 34: Update the stale comment describing the hal_rc output schema: replace
the mention of CORE_NUM with the current query name OS_SCHED and adjust the
brief description accordingly so hal_rc documents the result of
halGetDeviceInfo(AICPU, OS_SCHED) (or the equivalent kernel OS_SCHED query)
instead of CORE_NUM; locate the comment that references hal_rc in
hello_aicpu.cpp and edit the text to accurately reflect OS_SCHED semantics.
In `@tools/cann-examples/aicpu-kernel-launch/README.md`:
- Line 106: The comment for the hal_rc variable is stale — it references
CORE_NUM but the code/README now uses the OS_SCHED migration; update the hal_rc
comment to reflect the correct halGetDeviceInfo parameter (replace CORE_NUM with
the current constant used after the OS_SCHED migration, e.g., OS_SCHED) so the
comment on hal_rc and the halGetDeviceInfo(AICPU, ...) call matches the pipeline
diagram and expected output.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fc87f89a-0432-412a-bfaf-10d3f4e9a256
📒 Files selected for processing (3)
tools/cann-examples/aicpu-kernel-launch/README.mdtools/cann-examples/aicpu-kernel-launch/device/hello_aicpu.cpptools/cann-examples/aicpu-kernel-launch/host/launch_hello.cpp
Summary
Two bugs in #923 that the link-smoke CI step couldn't catch (it
compiles + links the binary but doesn't actually run it). Surfaced when
I ran
launch_helloon a3 hardware for the first time post-merge.rtsBinaryLoadFromFilereturned rc=107000 on the JSON descriptor. The temp file I generated with plainmkstemp(\"/tmp/simpler_inner_XXXXXX\")had no.jsonsuffix; CANN rejects descriptors without that extension. (The workingaicpu-device-querytool uses/tmp/simpler_inner_<fp>_<pid>.json, hence why nobody noticed.)mkstemps(\"/tmp/simpler_inner_XXXXXX.json\", 5)— keeps the suffix while still randomising the basename, preserving the symlink-attack hardening that introduced the originalmkstempchange.halGetDeviceInfo(AICPU, CORE_NUM)returns rc=3 from inside an AICPU OS process — the query is\"used in device\"restricted on both a3 and a5. The original printf inlaunch_helloclaimed the expected output wasrc=0 val=6, referencing the host-sidertGetAiCpuCount, which is wrong for device-side.halGetDeviceInfo(AICPU, OS_SCHED), which returns0x1on both a3 and a5 (AICPU OS owns cpu_id 0, per the device-side probe writeup insrc/{a2a3,a5}/docs/hardware.md). The chosen query now matches a documented device-side fact, and the printf can markrc=0 val=0x1asOK.Verified on a3
Ascend910_9392, device 4, viatask-submit:README updated: the pipeline diagram and expected-output sample
reflect the new HAL call, and a short paragraph points at the
device-side probe writeup for why
0x1is the right answer.Why the link-smoke CI didn't catch this
The
ut-a2a3/ut-a5smoke step builds the device SO (cross-compile)need a locked device. Could be added as a separate
st-onboard-*stepif we want CI to validate the end-to-end run, but that's a follow-up.
Test plan
launch_helloon a3 hardware viatask-submit→ all three checksOKut-a2a3+ut-a5build smokes still pass with the new code