[Bug] Fatal reporting is inconsistent across orchestration API and runtime paths

### Platform

All / Unknown

### Runtime Variant

tensormap_and_ringbuffer

### Description

The current fatal handling flow in `tensormap_and_ringbuffer` is inconsistent. As a result, "the orchestration has already entered fatal" and "the scheduler can observe the fatal condition and exit" are not guaranteed to stay in sync.

Confirmed problems include:

- Some runtime fatal paths only set local `orch->fatal` and do not publish a non-zero shared-memory `orch_error_code`
- Some orchestration helpers still forward calls into runtime after fatal instead of short-circuiting consistently
- Some `alloc_tensors(...)` paths still go through `always_assert` when already fatal or when arguments are invalid, instead of following the same fatal behavior as other paths

This leads to two observable classes of failures:

- The orchestration thread has already decided the run cannot continue, but scheduler threads never receive a fatal broadcast and therefore do not follow the unified exit path
- Repeated helper calls after fatal do not behave consistently: some become no-ops, some still proceed, and some assert immediately

Concrete gaps visible in the current code include:

- Timeout paths in `get_tensor_data(...)` / `set_tensor_data(...)`
- The deadlock-guaranteed submit path under `require_sync_start`
- Invalid-argument and already-fatal handling in `alloc_tensors(...)`

### Steps to Reproduce

```markdown
1. Run an orchestration using the `a5` `tensormap_and_ringbuffer` runtime.
2. Trigger any confirmed fatal path, for example:
   - Make `get_tensor_data(...)` or `set_tensor_data(...)` hit a timeout.
   - Trigger the invalid configuration branch of `require_sync_start`.
   - Call `alloc_tensors(...)` again after runtime is already fatal, or pass invalid arguments to it.
3. Observe the local orchestration fatal state, the shared-memory `orch_error_code`, and the behavior of subsequent helper calls.
```

### Expected Behavior

- Once fatal is entered, the runtime should follow one single and consistent exit semantic.
- Every fatal path that is supposed to trigger system-level termination should make the scheduler observe a non-zero `orch_error_code`.
- Repeated orchestration helper calls after fatal should behave consistently and predictably, without diverging into continued forwarding or immediate asserts.

### Actual Behavior

- Some paths only set local `orch->fatal` and do not publish a shared-memory error code.
- Some helpers still call into runtime after fatal instead of short-circuiting consistently.
- Some `alloc_tensors(...)` paths hit `always_assert` directly instead of converging onto the fatal semantics.
- In practice, fatal handling in runtime does not form a unified closed loop, and both scheduler exit visibility and API behavior can diverge.

### Git Commit ID

`5f5a74281519451414d2090aad483ad202437707`

### CANN Version

N/A

### Driver Version

N/A

### Host Platform

Other (issue identified by code inspection; not host-specific)

### Additional Context

- Relevant code areas include:
  - `src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp`
  - `src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp`
  - `src/a5/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h`
- The issue is described from the current implementation behavior only. It intentionally does not include a proposed fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Fatal reporting is inconsistent across orchestration API and runtime paths #505

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Fatal reporting is inconsistent across orchestration API and runtime paths #505

Description

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions