Platform
All / Unknown
Runtime Variant
tensormap_and_ringbuffer
Description
The current fatal handling flow in tensormap_and_ringbuffer is inconsistent. As a result, "the orchestration has already entered fatal" and "the scheduler can observe the fatal condition and exit" are not guaranteed to stay in sync.
Confirmed problems include:
- Some runtime fatal paths only set local
orch->fatal and do not publish a non-zero shared-memory orch_error_code
- Some orchestration helpers still forward calls into runtime after fatal instead of short-circuiting consistently
- Some
alloc_tensors(...) paths still go through always_assert when already fatal or when arguments are invalid, instead of following the same fatal behavior as other paths
This leads to two observable classes of failures:
- The orchestration thread has already decided the run cannot continue, but scheduler threads never receive a fatal broadcast and therefore do not follow the unified exit path
- Repeated helper calls after fatal do not behave consistently: some become no-ops, some still proceed, and some assert immediately
Concrete gaps visible in the current code include:
- Timeout paths in
get_tensor_data(...) / set_tensor_data(...)
- The deadlock-guaranteed submit path under
require_sync_start
- Invalid-argument and already-fatal handling in
alloc_tensors(...)
Steps to Reproduce
1. Run an orchestration using the `a5` `tensormap_and_ringbuffer` runtime.
2. Trigger any confirmed fatal path, for example:
- Make `get_tensor_data(...)` or `set_tensor_data(...)` hit a timeout.
- Trigger the invalid configuration branch of `require_sync_start`.
- Call `alloc_tensors(...)` again after runtime is already fatal, or pass invalid arguments to it.
3. Observe the local orchestration fatal state, the shared-memory `orch_error_code`, and the behavior of subsequent helper calls.
Expected Behavior
- Once fatal is entered, the runtime should follow one single and consistent exit semantic.
- Every fatal path that is supposed to trigger system-level termination should make the scheduler observe a non-zero
orch_error_code.
- Repeated orchestration helper calls after fatal should behave consistently and predictably, without diverging into continued forwarding or immediate asserts.
Actual Behavior
- Some paths only set local
orch->fatal and do not publish a shared-memory error code.
- Some helpers still call into runtime after fatal instead of short-circuiting consistently.
- Some
alloc_tensors(...) paths hit always_assert directly instead of converging onto the fatal semantics.
- In practice, fatal handling in runtime does not form a unified closed loop, and both scheduler exit visibility and API behavior can diverge.
Git Commit ID
5f5a74281519451414d2090aad483ad202437707
CANN Version
N/A
Driver Version
N/A
Host Platform
Other (issue identified by code inspection; not host-specific)
Additional Context
- Relevant code areas include:
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
src/a5/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h
- The issue is described from the current implementation behavior only. It intentionally does not include a proposed fix.
Platform
All / Unknown
Runtime Variant
tensormap_and_ringbuffer
Description
The current fatal handling flow in
tensormap_and_ringbufferis inconsistent. As a result, "the orchestration has already entered fatal" and "the scheduler can observe the fatal condition and exit" are not guaranteed to stay in sync.Confirmed problems include:
orch->fataland do not publish a non-zero shared-memoryorch_error_codealloc_tensors(...)paths still go throughalways_assertwhen already fatal or when arguments are invalid, instead of following the same fatal behavior as other pathsThis leads to two observable classes of failures:
Concrete gaps visible in the current code include:
get_tensor_data(...)/set_tensor_data(...)require_sync_startalloc_tensors(...)Steps to Reproduce
Expected Behavior
orch_error_code.Actual Behavior
orch->fataland do not publish a shared-memory error code.alloc_tensors(...)paths hitalways_assertdirectly instead of converging onto the fatal semantics.Git Commit ID
5f5a74281519451414d2090aad483ad202437707CANN Version
N/A
Driver Version
N/A
Host Platform
Other (issue identified by code inspection; not host-specific)
Additional Context
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cppsrc/a5/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h