Skip to content

[Bug] A2A3 spmd_paged_attention_highperf hardware run times out or produces partial zero output while a2a3sim passes #900

@MirkoDeVita98

Description

@MirkoDeVita98

Platform

a2a3 (Ascend 910B/C hardware)

Runtime Variant

tensormap_and_ringbuffer

Description

spmd_paged_attention_highperf does not behave correctly on A2A3 hardware, even though the same test passes on a2a3sim simulation. The original hardware run timed out in the runtime. After diagnostic changes that avoided unresolved relocations in the dynamically loaded AICore payload, the kernel completed but produced a golden mismatch: a subset of attention heads were computed accurately, while many heads remained all zero.

This points to a hardware-only dynamic payload loading / SPMD context issue rather than a numerical error in the attention math itself. The relevant loader currently extracts only raw .text bytes from AIC/AIV ELF objects and does not apply .rela.text relocations. The PA kernel object contains relocations for out-of-line calls and block-local/global symbols, so those references can be invalid when the raw payload is called directly on device.

Steps to Reproduce

From PR #899 :

cd tests/st/a2a3/tensormap_and_ringbuffer/spmd_paged_attention_highperf
python test_spmd_paged_attention_highperf.py -p a2a3

Control comparison:

python test_spmd_paged_attention_highperf.py -p a2a3sim

Expected Behavior

The A2A3 hardware run should match the golden output within the scene test tolerance, as the a2a3sim run does.

Actual Behavior

The original hardware run failed with an AICPU/runtime timeout similar to:

aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
PTO2 runtime failed: orch_error_code=0 sched_error_code=100 runtime_status=-100

After diagnostic changes to avoid some payload relocations, the run completed but failed validation:

FAILED: Golden mismatch on 'out': max_diff=0.60693359375, rtol=0.005, atol=0.02

A manual device-output dump showed that the device is computing some heads correctly and leaving many heads unwritten/all zero. Example stats from the diagnostic run:

Device shape: (1, 32, 128), dtype: torch.float16
Device nonzero: 1280 / 4096
Golden nonzero: 4096 / 4096
Device min/max: -0.515625 / 0.41748046875
Golden min/max: -0.5966796875 / 0.60693359375
Max diff: 0.60693359375
Mean diff: 0.07562728971242905
Argmax diff: (batch=0, head=19, dim=2)

Top mismatches are device zeros where golden is nonzero, for example:

head=19 dim=2:  device=0.0, golden=0.60693359375
head=1  dim=98: device=0.0, golden=-0.5966796875
head=17 dim=123: device=0.0, golden=0.5498046875

Computed heads have tiny error (~1e-4 to 2e-4 max diff), while skipped heads are exactly zero. This suggests work partitioning/SPMD context is not preserved correctly after avoiding the relocation-triggering code paths.

Git Commit ID

e85e8aa

CANN Version

9.0.0

Driver Version

25.5.1

Host Platform

Linux (aarch64)

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions