Skip to content

[Bug] Fatal reporting is inconsistent across orchestration API and runtime paths #505

@jvjhfhg

Description

@jvjhfhg

Platform

All / Unknown

Runtime Variant

tensormap_and_ringbuffer

Description

The current fatal handling flow in tensormap_and_ringbuffer is inconsistent. As a result, "the orchestration has already entered fatal" and "the scheduler can observe the fatal condition and exit" are not guaranteed to stay in sync.

Confirmed problems include:

  • Some runtime fatal paths only set local orch->fatal and do not publish a non-zero shared-memory orch_error_code
  • Some orchestration helpers still forward calls into runtime after fatal instead of short-circuiting consistently
  • Some alloc_tensors(...) paths still go through always_assert when already fatal or when arguments are invalid, instead of following the same fatal behavior as other paths

This leads to two observable classes of failures:

  • The orchestration thread has already decided the run cannot continue, but scheduler threads never receive a fatal broadcast and therefore do not follow the unified exit path
  • Repeated helper calls after fatal do not behave consistently: some become no-ops, some still proceed, and some assert immediately

Concrete gaps visible in the current code include:

  • Timeout paths in get_tensor_data(...) / set_tensor_data(...)
  • The deadlock-guaranteed submit path under require_sync_start
  • Invalid-argument and already-fatal handling in alloc_tensors(...)

Steps to Reproduce

1. Run an orchestration using the `a5` `tensormap_and_ringbuffer` runtime.
2. Trigger any confirmed fatal path, for example:
   - Make `get_tensor_data(...)` or `set_tensor_data(...)` hit a timeout.
   - Trigger the invalid configuration branch of `require_sync_start`.
   - Call `alloc_tensors(...)` again after runtime is already fatal, or pass invalid arguments to it.
3. Observe the local orchestration fatal state, the shared-memory `orch_error_code`, and the behavior of subsequent helper calls.

Expected Behavior

  • Once fatal is entered, the runtime should follow one single and consistent exit semantic.
  • Every fatal path that is supposed to trigger system-level termination should make the scheduler observe a non-zero orch_error_code.
  • Repeated orchestration helper calls after fatal should behave consistently and predictably, without diverging into continued forwarding or immediate asserts.

Actual Behavior

  • Some paths only set local orch->fatal and do not publish a shared-memory error code.
  • Some helpers still call into runtime after fatal instead of short-circuiting consistently.
  • Some alloc_tensors(...) paths hit always_assert directly instead of converging onto the fatal semantics.
  • In practice, fatal handling in runtime does not form a unified closed loop, and both scheduler exit visibility and API behavior can diverge.

Git Commit ID

5f5a74281519451414d2090aad483ad202437707

CANN Version

N/A

Driver Version

N/A

Host Platform

Other (issue identified by code inspection; not host-specific)

Additional Context

  • Relevant code areas include:
    • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp
    • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
    • src/a5/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h
  • The issue is described from the current implementation behavior only. It intentionally does not include a proposed fix.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions