Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 52 additions & 32 deletions src/a5/docs/hardware.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ this repo.
the spec table is the variant this repo's runtime targets. Check
the actual `Ascend950*.ini` for your SoC to confirm.

## Three views of "how many cores": observation + calibrated inference
## Three views of "how many cores": observation + device-side ground truth

a5's HAL exposes more layers than a3 does. The same `halGetDeviceInfo`
call surface has **different semantics** on a5 vs a3 — do not assume
Expand All @@ -63,47 +63,67 @@ HAL counts mean the same thing across generations.
| `rtGetAiCpuCount` | **6** | — | — |
| `aclrtGetDeviceInfo(ACL_DEV_ATTR_AICPU_CORE_NUM)` | **6** | — | — |
| CANN ini `ai_cpu_cnt` / `ai_core_cnt` / `vector_core_cnt` | (per-SKU, see ini) | (per-SKU) | (per-SKU) |
| `halGetDeviceInfo(AICPU, CORE_NUM)` | **8** | — | — |
| `halGetDeviceInfo(AICPU, OCCUPY)` | `0x1fe` (**9-bit** mask, 8 set) | — | — |
| `halGetDeviceInfo(AICPU, CORE_NUM)` host-side | **8** | — | — |
| `halGetDeviceInfo(AICPU, OCCUPY)` host-side | `0x1fe` (**9-bit** mask, 8 set: bits 1..8) | — | — |
| `halGetDeviceInfo(AICPU, IN_USED)` | **8** | — | — |
| `halGetDeviceInfo(AICORE, CORE_NUM)` | — | **36** (per device, = 2 dies × 18) | — |
| `halGetDeviceInfo(AICORE, DIE_NUM)` | — | **2** | — |
| `halGetDeviceInfo(VECTOR_CORE, CORE_NUM)` | — | — | **72** (per device) |
| DSMI `SOC_INFO+CPU_TOPO` | **9 logical CPUs** (8 physical + 1 hyperthread on phy_cpu_id 1) | — | — |

### Two-layer AICPU reservation on a5

`9 → 8 → 6` shows two distinct reservations stacked:

1. **9 logical CPUs** (DSMI CPU_TOPO total): 8 physical Taishan cores
on this die, one of which is hyperthreaded into 2 logical CPUs.
2. **8 in HAL OCCUPY mask** (`0x1fe = 0b111111110`): bit 0 is cleared,
bits 1–8 set. Whatever owns cpu_id 0 — likely the lowest-level
firmware / hypervisor — is below the HAL's view entirely.
3. **6 in `rtGetAiCpuCount`**: the additional 2 cores between HAL's
"occupied 8" and runtime's "user-visible 6" are most plausibly
AICPU-OS-reserved or PG-disabled, by analogy with a3 where a
device-side probe confirmed `OS_SCHED = 0x1` (1 OS core) + the
remaining gap cpu_id is PG fab-disabled.
(See [`src/a2a3/docs/hardware.md`](../../a2a3/docs/hardware.md#device-side-probe-resolves-the-aicpu-question)
for the technique that resolved the a3 question.)

**The a3-equivalent question on a5 is not yet resolved**:
`tools/cann-examples/aicpu-device-query/` should be run on a5 hardware
to read `AICPU + OS_SCHED` from inside an AICPU OS process — that one
bit pattern will tell us how many of the 2 "missing" cores between
HAL's 8 and runtime's 6 are OS-reserved (the rest being PG-disabled).
Until that probe runs on a5, the two-layer breakdown above is
**inference by analogy**, not direct measurement. Likewise the role of
cpu_id 0 (cleared in OCCUPY) — firmware-only / RAS / boot — remains
inferred until a device-side query covers it.
### Device-side probe resolves the AICPU question

CANN's `halGetDeviceInfo` exposes some queries (notably
`MODULE_TYPE_AICPU + INFO_TYPE_OS_SCHED`) that are flagged "used in
device" in the header — they only succeed when called from device-side
AICPU code, not from the host. The `tools/cann-examples/aicpu-device-query/`
companion tool uploads a small inner SO via the dispatcher bootstrap path,
runs HAL queries from inside an AICPU OS process, and reads results
back through GM. On this a5 host (`Ascend950PR_9599`) with local device
id 0 it returns:

| Query | Result | Interpretation |
| ----- | ------ | -------------- |
| `AICPU + OS_SCHED` | `0x1` | **AICPU OS owns exactly cpu_id 0** (single bit) |
| `AICPU + OCCUPY` (device-side) | `0x1f8 = 0b111111000` | **6 cores in the AICPU user pool at cpu_id 3..8** — not the `0x1fe` seen host-side. The 2-bit divergence (bits 1, 2) is the key new finding. |
| `AICPU + PF_OCCUPY` | `0x1f8` | identical to device-side OCCUPY → no SR-IOV / vNPU slicing |
| `AICPU + PF_CORE_NUM` | `6` | PF-view count matches user view → no virtualization |
| `AICPU + CORE_NUM` (device-side) | rc=3 | unlike a3, a5 restricts this query device-side — use `PF_CORE_NUM` instead |
| `CCPU + OCCUPY` | `0x1` | CCPU owns 1 core in its own namespace |
| `DCPU/TSCPU + OCCUPY`, `+ CORE_NUM` | rc=3 | module-level access restricted device-side (same as a3) |

The host-side / device-side OCCUPY divergence is **a5-specific**: on a3
both views return the same `0xfc`. On a5 host-side reports 8 enabled
cores (`0x1fe`) but the device-side AICPU OS exposes only 6 to its user
kernel pool (`0x1f8`). The 2-bit gap (bits 1, 2) exactly matches DSMI
CPU_TOPO's lone hyperthread pair on phy_cpu_id 1 — the AICPU OS keeps
the SMT-paired logical CPUs for itself rather than dispatching user
kernels onto them.

Combined with the absence of any vNPU mode (`is_virtual: no` via ACL),
the AICPU side splits as:

| Slot | Owner | Evidence |
| ---- | ----- | -------- |
| cpu_id 0 | AICPU OS scheduler | OS_SCHED bit 0 = 1 (device-side probe); cleared in host-side OCCUPY by design (OS scheduler is exposed via OS_SCHED, not OCCUPY) |
| cpu_id 1, 2 | Hyperthread pair on phy_cpu_id 1, withheld from the user pool by the AICPU OS | present in host-side OCCUPY (`0x1fe`) so they are **not** PG fab-disabled — that would clear them everywhere as cpu_id 1 was on a3. Absent from device-side AICPU OCCUPY (`0x1f8`), absent from CCPU OCCUPY (`0x1`). DSMI CPU_TOPO labels exactly this pair as the chip's only SMT pair. AICPU OS withholds SMT pairs from user dispatch to avoid intra-pair contention. |
| cpu_id 3..8 | user-schedulable (6) | device-side OCCUPY bits 3..8 set; matches `rtGetAiCpuCount=6` and `PF_CORE_NUM=6` |

The 9 → 6 gap on a5 is therefore **1 AICPU OS-reserved (cpu_id 0) + 2
SMT-pair withheld from user (cpu_id 1, 2)**, not "AICPU-OS-reserved
or PG fab-disabled" as the earlier inference from HAL host-side data
alone suggested. PG fab-disable can be ruled out on a5 by the host-side
OCCUPY containing both gap slots.

### Key semantic differences from a3

| Observation | a3 (Ascend910_93xx) | a5 (Ascend950) |
| ----------- | ------------------- | -------------- |
| `halGetDeviceInfo(AICPU, CORE_NUM)` | 6 (matches user-visible) | **8** (does NOT match user-visible) |
| `halGetDeviceInfo(AICPU, OCCUPY)` | 8-bit `0xfc` | **9-bit `0x1fe`** |
| `halGetDeviceInfo(AICPU, CORE_NUM)` host-side | 6 (matches user-visible) | **8** (does NOT match user-visible) |
| `halGetDeviceInfo(AICPU, CORE_NUM)` device-side | 6 (succeeds) | **rc=3** (restricted) |
| `halGetDeviceInfo(AICPU, OCCUPY)` host-side | 8-bit `0xfc` | **9-bit `0x1fe`** |
| `halGetDeviceInfo(AICPU, OCCUPY)` device-side | `0xfc` (matches host) | **`0x1f8` (differs from host)** — AICPU OS withholds the SMT pair |
| `AICPU` gap composition (HAL → user) | 1 OS-reserved + 1 PG fab-disabled | **1 OS-reserved + 2 SMT-pair withheld** (no PG-disable) |
| Logical vs physical AICPU | no hyperthread evidence | **1 phy core hyperthreaded → 9 logical** |
| `halGetDeviceInfo(AICORE, DIE_NUM)` | fails (rc=3) | works, returns **2** |
| `halGetDeviceInfo(AICORE, CORE_NUM)` | 25 per die | **36 per device** (aggregates both dies) |
Expand All @@ -124,7 +144,7 @@ report what user code can address.
| Counting cores in a multi-die a5 device | **per-device** HAL CORE_NUM (= 2 × per-die) |
| Reasoning about hyperthreading on AICPU | **DSMI CPU_TOPO** (only it shows the hyperthread pair on cpu_id 1+2) |
| Writing code expected to also work on a3 | **ACL or CANN ini only** — HAL semantics differ |
| Debugging "I requested N AICPU, only 6 ran" | gap is the **AICPU OS + lowest-level reservation**, totalling 3 cores between physical 9 and user 6 |
| Debugging "I requested N AICPU, only 6 ran" | gap is **1 AICPU OS scheduler (cpu_id 0) + 2 SMT-pair (cpu_id 1, 2) withheld by AICPU OS**; cap is 6 |

For cross-generation portable code: **always go through ACL or CANN
ini, never HAL**. HAL's CORE_NUM semantics shift between a3 and a5 in
Expand Down
2 changes: 1 addition & 1 deletion tools/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,6 @@ resolves the "used in device" HAL queries (`AICPU + OS_SCHED`,
`AICPU + PF_*`, etc.) that always fail from host code. Uploads a small
inner SO via the same dispatcher bootstrap path the production runtime
uses; results come back through GM. Documents the resolution of the
a3 AICPU 8 → 6 split — see the tool's own
a3 AICPU 8 → 6 split and the a5 AICPU 9 → 6 split — see the tool's own
[README](./cann-examples/aicpu-device-query/README.md) for build/run
instructions and what it confirmed.
46 changes: 36 additions & 10 deletions tools/cann-examples/aicpu-device-query/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,32 @@ the AICPU 8 → 6 gap OS-reservation or PG fab-disable?" question on a3:
The full writeup is in
[`src/a2a3/docs/hardware.md`](../../../src/a2a3/docs/hardware.md#device-side-probe-resolves-the-aicpu-question).

On a5 (`Ascend950PR_9599`), running `query_device_hal 0` returned:

```text
AICPU + OS_SCHED rc=0 val=0x1 ← AICPU OS owns cpu_id 0
AICPU + OCCUPY rc=0 val=0x1f8 ← cpu_id 3..8 are the 6 user cores
(DIFFERS from host-side 0x1fe)
AICPU + PF_OCCUPY rc=0 val=0x1f8 ← matches OCCUPY → no vNPU slicing
AICPU + PF_CORE_NUM rc=0 val=0x6 ← PF view = 6, confirms no virtualization
AICPU + CORE_NUM rc=3 ← restricted device-side on a5 (worked on a3)
CCPU + OCCUPY rc=0 val=0x1 ← CCPU has 1 core, occupied
DCPU/TSCPU queries fail (rc=3) — module-level access restricted device-side
```

The 2-bit host-vs-device OCCUPY divergence (host `0x1fe`, device `0x1f8`)
exactly matches DSMI CPU_TOPO's single SMT pair at cpu_id 1, 2 — the
AICPU OS keeps the SMT pair for itself rather than exposing it to user
kernel dispatch. So the a5 9 → 6 gap is:

- **cpu_id 0** = AICPU OS scheduler (`OS_SCHED` bit 0)
- **cpu_id 1, 2** = SMT pair on phy_cpu_id 1, AICPU-OS-reserved (present
in host-side OCCUPY so **not** PG fab-disabled — a5 does not have the
a3-style fab-disable pattern)
- **cpu_id 3..8** = 6 user-schedulable AICPU cores

Full writeup: [`src/a5/docs/hardware.md`](../../../src/a5/docs/hardware.md#device-side-probe-resolves-the-aicpu-question).

## Architecture

Three pieces, exact same wiring as the production runtime's AICPU upload
Expand Down Expand Up @@ -94,18 +120,18 @@ task-submit --device auto --device-num 1 \
--run "$REPO/tools/cann-examples/aicpu-device-query/host/build/query_device_hal \$TASK_DEVICE"
```

## Adapting to other arches
## Running on each arch

The same CMakeLists builds for a3 and a5 — only the dispatcher SO you
point at via `SIMPLER_DISPATCHER_SO` is per-arch:

a3 is the only arch this has been validated on. To run on a5:
- a3: `build/lib/a2a3/dispatcher/libsimpler_aicpu_dispatcher.so`
- a5: `build/lib/a5/dispatcher/libsimpler_aicpu_dispatcher.so`

1. Build the device SO with the same CMakeLists — `libascend_hal.so` is
under `${ASCEND_HOME_PATH}/${CMAKE_SYSTEM_PROCESSOR}-linux/devlib/...`
for both arches.
2. Make sure the dispatcher SO you point at via `SIMPLER_DISPATCHER_SO`
is the **a5** one (`build/lib/a5/dispatcher/libsimpler_aicpu_dispatcher.so`)
if running on a5 hardware. The dispatcher SO is per-arch.
3. The `AICPU + OS_SCHED` mask on a5 directly resolves the analogous
question in [`src/a5/docs/hardware.md`](../../../src/a5/docs/hardware.md).
`libascend_hal.so` lives at the same relative path
(`${ASCEND_HOME_PATH}/${CMAKE_SYSTEM_PROCESSOR}-linux/devlib/...`) on
both arches. Both `Ascend910_9392` (a3) and `Ascend950PR_9599` (a5)
have been validated with this tool — see "What it answered" above.

## Scope and limits

Expand Down
Loading