Skip to content

Fix: aicpu-kernel-launch hardware end-to-end issues#929

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:fix/aicpu-kernel-launch-hardware-fixes
May 31, 2026
Merged

Fix: aicpu-kernel-launch hardware end-to-end issues#929
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
hw-native-sys-bot:fix/aicpu-kernel-launch-hardware-fixes

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

Summary

Two bugs in #923 that the link-smoke CI step couldn't catch (it
compiles + links the binary but doesn't actually run it). Surfaced when
I ran launch_hello on a3 hardware for the first time post-merge.

# Bug Fix
1 rtsBinaryLoadFromFile returned rc=107000 on the JSON descriptor. The temp file I generated with plain mkstemp(\"/tmp/simpler_inner_XXXXXX\") had no .json suffix; CANN rejects descriptors without that extension. (The working aicpu-device-query tool uses /tmp/simpler_inner_<fp>_<pid>.json, hence why nobody noticed.) Switched to mkstemps(\"/tmp/simpler_inner_XXXXXX.json\", 5) — keeps the suffix while still randomising the basename, preserving the symlink-attack hardening that introduced the original mkstemp change.
2 halGetDeviceInfo(AICPU, CORE_NUM) returns rc=3 from inside an AICPU OS process — the query is \"used in device\" restricted on both a3 and a5. The original printf in launch_hello claimed the expected output was rc=0 val=6, referencing the host-side rtGetAiCpuCount, which is wrong for device-side. Switched the device kernel to halGetDeviceInfo(AICPU, OS_SCHED), which returns 0x1 on both a3 and a5 (AICPU OS owns cpu_id 0, per the device-side probe writeup in src/{a2a3,a5}/docs/hardware.md). The chosen query now matches a documented device-side fact, and the printf can mark rc=0 val=0x1 as OK.

Verified on a3

Ascend910_9392, device 4, via task-submit:

[bootstrap] inner SO landed at /usr/lib64/aicpu_kernels/0/aicpu_kernels_device/simpler_inner_771d034f21df8b91_4.so
[bootstrap] fp=771d034f21df8b91 (token=a58fd475f00dbeef)

=== device=4  hello_aicpu HelloResult ===
  magic         = 0xdeadbeefc0ffee01  OK
  echoed_token  = 0xa58fd475f00dbeef  OK (expected 0xa58fd475f00dbeef)
  hal AICPU+OS_SCHED  rc=0 val=0x1  OK

README updated: the pipeline diagram and expected-output sample
reflect the new HAL call, and a short paragraph points at the
device-side probe writeup for why 0x1 is the right answer.

Why the link-smoke CI didn't catch this

The ut-a2a3 / ut-a5 smoke step builds the device SO (cross-compile)

  • host launcher (link), but does not run the binary — that would
    need a locked device. Could be added as a separate st-onboard-* step
    if we want CI to validate the end-to-end run, but that's a follow-up.

Test plan

  • Run launch_hello on a3 hardware via task-submit → all three checks OK
  • CI: ut-a2a3 + ut-a5 build smokes still pass with the new code
  • Run on a5 (deferred — same dispatcher pipeline, lower risk of a5-specific divergence)

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 31, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR updates the aicpu-kernel-launch example to change the device-side HAL query from halGetDeviceInfo(AICPU, CORE_NUM) returning 6 to halGetDeviceInfo(AICPU, OS_SCHED) returning 0x1. Device-side kernel code, host-side verification, and documentation are updated consistently. The host also improves temp file creation by using mkstemps with a .json suffix.

Changes

HAL Query Update: CORE_NUM to OS_SCHED

Layer / File(s) Summary
Device-side HAL query implementation
tools/cann-examples/aicpu-kernel-launch/device/hello_aicpu.cpp
Device kernel comments and simpler_aicpu_run implementation switch from querying CORE_NUM to OS_SCHED (constant kInfoOsSched = 5), with updated documentation describing the expected 0x1 bitmask result.
Host-side verification and temp file improvements
tools/cann-examples/aicpu-kernel-launch/host/launch_hello.cpp
HAL verification is updated to expect OS_SCHED result 0x1, and temporary JSON descriptor file creation switches from mkstemp to mkstemps with .json suffix.
README documentation updates
tools/cann-examples/aicpu-kernel-launch/README.md
Pipeline diagram and successful output documentation are updated to show halGetDeviceInfo(AICPU, OS_SCHED) with expected result 0x1 and explanatory notes for a3 and a5 hardware.

Possibly Related PRs

  • hw-native-sys/simpler#923: Updates the newly added aicpu-kernel-launch example to switch halGetDeviceInfo from AICPU+CORE_NUM to AICPU+OS_SCHED with adjusted expected outputs.

Estimated Code Review Effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 A query takes flight, from core count to sched,
Device speaks to host with new values instead,
OS_SCHED now whispers 0x1 so true,
On a3 and a5, the hardware rings through,
With temp files now safely named .json throughout!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main change: fixing hardware end-to-end issues in the aicpu-kernel-launch example, which aligns with the two bugs addressed in the changeset.
Description check ✅ Passed The description is detailed and directly related to the changeset, explaining the two bugs found during hardware testing and their fixes, with verification results and test plan.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the AICPU kernel launch example to query AICPU + OS_SCHED instead of AICPU + CORE_NUM via halGetDeviceInfo, as the latter is restricted inside AICPU OS processes. It also modifies the host-side code to use mkstemps to ensure the temporary JSON file has a .json extension, which is required by CANN's rtsBinaryLoadFromFile. Feedback on the changes points out a potential portability issue in launch_hello.cpp where got.hal_value is cast to long and printed with %lx, which can cause truncation on platforms where long is 32-bit (e.g., Windows). It is recommended to cast to unsigned long long and use %llx instead.

Comment thread tools/cann-examples/aicpu-kernel-launch/host/launch_hello.cpp
Two bugs surfaced when first running launch_hello on a3 hardware.
Neither was caught by the link-smoke CI (the link smoke compiles +
links the binary, doesn't actually run it).

1. `rtsBinaryLoadFromFile` returned rc=107000 on the JSON descriptor.
   Root cause: I switched the temp file from a predictable
   /tmp/simpler_inner_<fp>_<pid>.json (which the working
   aicpu-device-query uses) to plain `mkstemp("/tmp/simpler_inner_XXXXXX")`
   with no .json suffix. CANN's rtsBinaryLoadFromFile rejects paths
   without the .json extension. Switched to `mkstemps("/tmp/simpler_inner_XXXXXX.json", 5)`
   which keeps the suffix while still randomising the basename
   (the security-against-symlink-attacks property the original
   mkstemp change introduced).

2. The HAL call `halGetDeviceInfo(AICPU, CORE_NUM)` returns rc=3
   from inside an AICPU OS process — the query is "used in device"
   restricted on both a3 and a5. My printf in launch_hello claimed
   the expected output was `rc=0 val=6` referencing host-side
   rtGetAiCpuCount, but that count is only host-readable. Switched
   the device kernel to `halGetDeviceInfo(AICPU, OS_SCHED)`, which
   returns 0x1 on both a3 and a5 (AICPU OS owns cpu_id 0, per the
   device-side probe writeup in src/{a2a3,a5}/docs/hardware.md).
   The chosen query now matches a documented device-side fact, the
   demo's HAL output line is meaningful, and the printf can mark
   the result `OK` when it matches `0x1`.

Verified end-to-end on a3 (Ascend910_9392, device 4) via task-submit:

  === device=4  hello_aicpu HelloResult ===
    magic         = 0xdeadbeefc0ffee01  OK
    echoed_token  = 0xa58fd475f00dbeef  OK
    hal AICPU+OS_SCHED  rc=0 val=0x1  OK

README updated: pipeline diagram now shows `halGetDeviceInfo(AICPU,
OS_SCHED) -> 0x1` and the expected-output sample reflects the new
HAL call + adds a paragraph pointing at the device-side probe
writeup for why 0x1 is the right answer.
@ChaoWao ChaoWao force-pushed the fix/aicpu-kernel-launch-hardware-fixes branch from e39f965 to 5f537f8 Compare May 31, 2026 02:13
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tools/cann-examples/aicpu-kernel-launch/device/hello_aicpu.cpp (1)

34-34: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stale CORE_NUM reference in the output schema comment.

This hal_rc doc comment still describes the CORE_NUM query while the kernel now issues OS_SCHED. Since this file is explicitly a copy-paste template, the leftover reference is misleading.

📝 Proposed fix
-//     int32  hal_rc        // halGetDeviceInfo(AICPU, CORE_NUM) rc
+//     int32  hal_rc        // halGetDeviceInfo(AICPU, OS_SCHED) rc
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/cann-examples/aicpu-kernel-launch/device/hello_aicpu.cpp` at line 34,
Update the stale comment describing the hal_rc output schema: replace the
mention of CORE_NUM with the current query name OS_SCHED and adjust the brief
description accordingly so hal_rc documents the result of
halGetDeviceInfo(AICPU, OS_SCHED) (or the equivalent kernel OS_SCHED query)
instead of CORE_NUM; locate the comment that references hal_rc in
hello_aicpu.cpp and edit the text to accurately reflect OS_SCHED semantics.
tools/cann-examples/aicpu-kernel-launch/README.md (1)

106-106: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stale CORE_NUM reference in the I/O contract block.

The hal_rc field comment still references CORE_NUM, which is now inconsistent with the rest of this README (pipeline diagram, expected output) after the OS_SCHED migration.

📝 Proposed fix
-    int32_t  hal_rc;                 // halGetDeviceInfo(AICPU, CORE_NUM) rc
+    int32_t  hal_rc;                 // halGetDeviceInfo(AICPU, OS_SCHED) rc
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/cann-examples/aicpu-kernel-launch/README.md` at line 106, The comment
for the hal_rc variable is stale — it references CORE_NUM but the code/README
now uses the OS_SCHED migration; update the hal_rc comment to reflect the
correct halGetDeviceInfo parameter (replace CORE_NUM with the current constant
used after the OS_SCHED migration, e.g., OS_SCHED) so the comment on hal_rc and
the halGetDeviceInfo(AICPU, ...) call matches the pipeline diagram and expected
output.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tools/cann-examples/aicpu-kernel-launch/device/hello_aicpu.cpp`:
- Line 34: Update the stale comment describing the hal_rc output schema: replace
the mention of CORE_NUM with the current query name OS_SCHED and adjust the
brief description accordingly so hal_rc documents the result of
halGetDeviceInfo(AICPU, OS_SCHED) (or the equivalent kernel OS_SCHED query)
instead of CORE_NUM; locate the comment that references hal_rc in
hello_aicpu.cpp and edit the text to accurately reflect OS_SCHED semantics.

In `@tools/cann-examples/aicpu-kernel-launch/README.md`:
- Line 106: The comment for the hal_rc variable is stale — it references
CORE_NUM but the code/README now uses the OS_SCHED migration; update the hal_rc
comment to reflect the correct halGetDeviceInfo parameter (replace CORE_NUM with
the current constant used after the OS_SCHED migration, e.g., OS_SCHED) so the
comment on hal_rc and the halGetDeviceInfo(AICPU, ...) call matches the pipeline
diagram and expected output.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fc87f89a-0432-412a-bfaf-10d3f4e9a256

📥 Commits

Reviewing files that changed from the base of the PR and between 55c3d8b and 5f537f8.

📒 Files selected for processing (3)
  • tools/cann-examples/aicpu-kernel-launch/README.md
  • tools/cann-examples/aicpu-kernel-launch/device/hello_aicpu.cpp
  • tools/cann-examples/aicpu-kernel-launch/host/launch_hello.cpp

@ChaoWao ChaoWao merged commit 39c0445 into hw-native-sys:main May 31, 2026
29 of 31 checks passed
@ChaoWao ChaoWao deleted the fix/aicpu-kernel-launch-hardware-fixes branch May 31, 2026 02:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants