Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
spmd_paged_attention_highperf does not behave correctly on A2A3 hardware, even though the same test passes on a2a3sim simulation. The original hardware run timed out in the runtime. After diagnostic changes that avoided unresolved relocations in the dynamically loaded AICore payload, the kernel completed but produced a golden mismatch: a subset of attention heads were computed accurately, while many heads remained all zero.
This points to a hardware-only dynamic payload loading / SPMD context issue rather than a numerical error in the attention math itself. The relevant loader currently extracts only raw .text bytes from AIC/AIV ELF objects and does not apply .rela.text relocations. The PA kernel object contains relocations for out-of-line calls and block-local/global symbols, so those references can be invalid when the raw payload is called directly on device.
Steps to Reproduce
From PR #899 :
cd tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf
python test_spmd_paged_attention_highperf.py -p a2a3
Control comparison:
python test_spmd_paged_attention_highperf.py -p a2a3sim
Expected Behavior
The A2A3 hardware run should match the golden output within the scene test tolerance, as the a2a3sim run does.
Actual Behavior
The original hardware run failed with an AICPU/runtime timeout similar to:
aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
PTO2 runtime failed: orch_error_code=0 sched_error_code=100 runtime_status=-100
After diagnostic changes to avoid some payload relocations, the run completed but failed validation:
FAILED: Golden mismatch on 'out': max_diff=0.60693359375, rtol=0.005, atol=0.02
A manual device-output dump showed that the device is computing some heads correctly and leaving many heads unwritten/all zero. Example stats from the diagnostic run:
Device shape: (1, 32, 128), dtype: torch.float16
Device nonzero: 1280 / 4096
Golden nonzero: 4096 / 4096
Device min/max: -0.515625 / 0.41748046875
Golden min/max: -0.5966796875 / 0.60693359375
Max diff: 0.60693359375
Mean diff: 0.07562728971242905
Argmax diff: (batch=0, head=19, dim=2)
Top mismatches are device zeros where golden is nonzero, for example:
head=19 dim=2: device=0.0, golden=0.60693359375
head=1 dim=98: device=0.0, golden=-0.5966796875
head=17 dim=123: device=0.0, golden=0.5498046875
Computed heads have tiny error (~1e-4 to 2e-4 max diff), while skipped heads are exactly zero. This suggests work partitioning/SPMD context is not preserved correctly after avoiding the relocation-triggering code paths.
Git Commit ID
e85e8aa
CANN Version
9.0.0
Driver Version
25.5.1
Host Platform
Linux (aarch64)
Additional Context
No response
Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
spmd_paged_attention_highperfdoes not behave correctly on A2A3 hardware, even though the same test passes ona2a3simsimulation. The original hardware run timed out in the runtime. After diagnostic changes that avoided unresolved relocations in the dynamically loaded AICore payload, the kernel completed but produced a golden mismatch: a subset of attention heads were computed accurately, while many heads remained all zero.This points to a hardware-only dynamic payload loading / SPMD context issue rather than a numerical error in the attention math itself. The relevant loader currently extracts only raw
.textbytes from AIC/AIV ELF objects and does not apply.rela.textrelocations. The PA kernel object contains relocations for out-of-line calls and block-local/global symbols, so those references can be invalid when the raw payload is called directly on device.Steps to Reproduce
From PR #899 :
cd tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf python test_spmd_paged_attention_highperf.py -p a2a3Control comparison:
Expected Behavior
The A2A3 hardware run should match the golden output within the scene test tolerance, as the
a2a3simrun does.Actual Behavior
The original hardware run failed with an AICPU/runtime timeout similar to:
After diagnostic changes to avoid some payload relocations, the run completed but failed validation:
A manual device-output dump showed that the device is computing some heads correctly and leaving many heads unwritten/all zero. Example stats from the diagnostic run:
Top mismatches are device zeros where golden is nonzero, for example:
Computed heads have tiny error (
~1e-4to2e-4max diff), while skipped heads are exactly zero. This suggests work partitioning/SPMD context is not preserved correctly after avoiding the relocation-triggering code paths.Git Commit ID
e85e8aa
CANN Version
9.0.0
Driver Version
25.5.1
Host Platform
Linux (aarch64)
Additional Context
No response