Skip to content

[Bug] Dynamic Register/Unregister Instability In A2A3 Sim CI #884

@ccyywwen

Description

@ccyywwen

Platform

a2a3 (Ascend 910B/C hardware)

Runtime Variant

tensormap_and_ringbuffer

Description

#839 introduced dynamic post-init callable register/unregister coverage under:

tests/st/a2a3/tensormap_and_ringbuffer/dynamic_register/test_dynamic_register.py

The dynamic register/unregister path is still unstable in #861 CI. Two PR861
CI runs exposed failures in the same dynamic-register ST file on the same CI
job family:

  1. PR861 CI #2723

  2. PR861 CI #2749

These are not the same exact failure signature: CI #2723 failed with a
segmentation fault in the two-device parallel dynamic-register case, while CI
#2749 passed that case and instead hung in the single-device
unregister/re-register reuse case. They should still be tracked together as a
PR839 dynamic register/unregister stability issue because both failures occur in
the same feature area and the same ST file.

Steps to Reproduce

Run PR861 CI on the host-device_mapped-region branch with the standard CI
workflow and inspect the Ubuntu A2A3 simulation ST job:

CI / st-sim-a2a3 (ubuntu-latest, 3.10)

The relevant full CI invocations were:

PR861 CI #2723:
https://github.com/hw-native-sys/simpler/actions/runs/26559663566

PR861 CI #2749:
https://github.com/hw-native-sys/simpler/actions/runs/26575577396

For local focused reproduction, run the dynamic-register ST cases on A2A3 sim:

pytest tests/st/a2a3/tensormap_and_ringbuffer/dynamic_register/test_dynamic_register.py \
  --platform a2a3sim --device 0-1 -p no:xdist --pto-session-timeout 600

The two observed failing standalone cases can also be targeted directly:

pytest tests/st/a2a3/tensormap_and_ringbuffer/dynamic_register/test_dynamic_register.py::test_register_after_init_parallel_broadcast \
  --platform a2a3sim --device 0-1 -p no:xdist --pto-session-timeout 600

pytest tests/st/a2a3/tensormap_and_ringbuffer/dynamic_register/test_dynamic_register.py::test_register_unregister_register_runs_each_time \
  --platform a2a3sim --device 0 -p no:xdist --pto-session-timeout 600

Because the failures appear intermittent, a single local run may pass. Looping
these focused cases is likely needed to reproduce the instability.

Expected Behavior

Dynamic post-init register and unregister should be deterministic and safe in
A2A3 simulation:

  • test_register_after_init_parallel_broadcast should successfully broadcast a
    post-init CTRL_REGISTER to both chip children, return only after each child
    has prepared the callable, and then run the dynamically registered cid on
    both chips without crashing.
  • test_register_unregister_register_runs_each_time should successfully run a
    dynamically registered cid, unregister it, reuse the freed cid slot on a
    subsequent register, and run the re-registered callable without hanging.
  • The full st-sim-a2a3 (ubuntu-latest, 3.10) CI job should complete without
    segfaults, hangs, or session-level timeouts.

Actual Behavior

Observed in PR861 CI #2723:

[scheduler] START standalone test_register_after_init_parallel_broadcast
(rt=tensormap_and_ringbuffer, dev=2) pid=8668 devices=[6, 10]

standalone test_register_after_init_parallel_broadcast
(rt=tensormap_and_ringbuffer, dev=2) [FAIL rc=-11 57.8s, devices=[6, 10]]

tests/st/a2a3/tensormap_and_ringbuffer/dynamic_register/
test_dynamic_register.py Fatal Python error: Segmentation fault

File ".../site-packages/simpler/worker.py", line 2369 in run
File ".../dynamic_register/test_dynamic_register.py", line 234
in test_register_after_init_parallel_broadcast

Process completed with exit code 1.

Observed in PR861 CI #2749:

[scheduler] START standalone test_register_after_init_parallel_broadcast
(rt=tensormap_and_ringbuffer, dev=2) pid=8857 devices=[6, 10]

standalone test_register_after_init_parallel_broadcast
(rt=tensormap_and_ringbuffer, dev=2) [PASS 21.4s, devices=[6, 10]]

[scheduler] START standalone test_register_unregister_register_runs_each_time
(rt=tensormap_and_ringbuffer, dev=1) pid=9389 devices=[8]

[pytest] TIMEOUT: session exceeded 600s (10min) limit

HUNG standalone test_register_unregister_register_runs_each_time
(rt=tensormap_and_ringbuffer, dev=1) pid=9389 devices=[8]
elapsed=490.1s descendants=[9565, 9566]

Process completed with exit code 124.

This indicates that the PR839 dynamic register/unregister path can fail in at
least two ways under CI load: a post-register worker.run(...) segfault in the
two-device broadcast case, and a hang in the unregister/re-register cid reuse
case.

Git Commit ID

825f0fd

CANN Version

No response

Driver Version

No response

Host Platform

Linux (x86_64)

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions