[Bug] A2A3 `spmd_paged_attention_highperf` hardware run times out or produces partial zero output while `a2a3sim` passes

### Platform

a2a3 (Ascend 910B/C hardware)

### Runtime Variant

tensormap_and_ringbuffer

### Description

`spmd_paged_attention_highperf` does not behave correctly on A2A3 hardware, even though the same test passes on `a2a3sim` simulation. The original hardware run timed out in the runtime. After diagnostic changes that avoided unresolved relocations in the dynamically loaded AICore payload, the kernel completed but produced a golden mismatch: a subset of attention heads were computed accurately, while many heads remained all zero.

This points to a hardware-only dynamic payload loading / SPMD context issue rather than a numerical error in the attention math itself. The relevant loader currently extracts only raw `.text` bytes from AIC/AIV ELF objects and does not apply `.rela.text` relocations. The PA kernel object contains relocations for out-of-line calls and block-local/global symbols, so those references can be invalid when the raw payload is called directly on device.

### Steps to Reproduce

From PR https://github.com/hw-native-sys/simpler/pull/899 :

```bash
cd tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf
python test_spmd_paged_attention_highperf.py -p a2a3
```

Control comparison:

```bash
python test_spmd_paged_attention_highperf.py -p a2a3sim
```

### Expected Behavior

The A2A3 hardware run should match the golden output within the scene test tolerance, as the `a2a3sim` run does.

### Actual Behavior

The original hardware run failed with an AICPU/runtime timeout similar to:

```text
aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
PTO2 runtime failed: orch_error_code=0 sched_error_code=100 runtime_status=-100
```

After diagnostic changes to avoid some payload relocations, the run completed but failed validation:

```text
FAILED: Golden mismatch on 'out': max_diff=0.60693359375, rtol=0.005, atol=0.02
```

A manual device-output dump showed that the device is computing some heads correctly and leaving many heads unwritten/all zero. Example stats from the diagnostic run:

```text
Device shape: (1, 32, 128), dtype: torch.float16
Device nonzero: 1280 / 4096
Golden nonzero: 4096 / 4096
Device min/max: -0.515625 / 0.41748046875
Golden min/max: -0.5966796875 / 0.60693359375
Max diff: 0.60693359375
Mean diff: 0.07562728971242905
Argmax diff: (batch=0, head=19, dim=2)
```

Top mismatches are device zeros where golden is nonzero, for example:

```text
head=19 dim=2:  device=0.0, golden=0.60693359375
head=1  dim=98: device=0.0, golden=-0.5966796875
head=17 dim=123: device=0.0, golden=0.5498046875
```

Computed heads have tiny error (`~1e-4` to `2e-4` max diff), while skipped heads are exactly zero. This suggests work partitioning/SPMD context is not preserved correctly after avoiding the relocation-triggering code paths.

### Git Commit ID

e85e8aa59d47f54ed1ff611321cd4244581ae7cf

### CANN Version

9.0.0

### Driver Version

25.5.1

### Host Platform

Linux (aarch64)

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] A2A3 `spmd_paged_attention_highperf` hardware run times out or produces partial zero output while `a2a3sim` passes #900

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] A2A3 spmd_paged_attention_highperf hardware run times out or produces partial zero output while a2a3sim passes #900

Description

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Bug] A2A3 `spmd_paged_attention_highperf` hardware run times out or produces partial zero output while `a2a3sim` passes #900