Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 15 additions & 14 deletions docs/dfx/dep_gen.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,10 +74,10 @@ is `--enable-dep-gen`:

```bash
# Standalone
python test_my_case.py --platform a2a3 --enable-dep-gen --enable-l2-swimlane
python test_my_case.py --platform <a2a3|a5> --enable-dep-gen --enable-l2-swimlane

# Pytest
pytest tests/st/... --platform a2a3 --enable-dep-gen --enable-l2-swimlane
pytest tests/st/... --platform <a2a3|a5> --enable-dep-gen --enable-l2-swimlane
```

The `--enable-l2-swimlane` flag is independent but recommended in pair
Expand Down Expand Up @@ -295,10 +295,11 @@ underlying task-pair count.
| `task.fanout[]` (L2PerfRecord) | Successors known at producer-retire time | **Yes** — sealed when producer retires |
| `deps.json` (this feature) | Every consumer → producer reachable via tensormap / explicit_deps | No — replay sees every submit |

`tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/test_dep_gen_capture.py`
enforces `fanout ⊆ deps` as a validation gate: any edge in fanout that
is missing from deps is a replay-side regression and fails the test.
Cases where `deps - fanout ≠ ∅` are the dep_gen sweet spot — those are
`tests/st/{a2a3,a5}/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py`
enforces the 6-edge expectation against the `vector_example`
orchestration as a validation gate: any edge missing from `deps.json`
is a replay-side regression and fails the test. Cases where
`deps - fanout ≠ ∅` are the dep_gen sweet spot — those are
exactly the race-window edges fanout dropped. The
`swimlane_converter.py` uses `deps.json` (when present) as the source
of flow events in the Perfetto trace, and flags any edge whose
Expand Down Expand Up @@ -350,13 +351,13 @@ list; only the dep_gen replay graph loses the tail.

| Layer | File | Role |
| ----- | ---- | ---- |
| Shared-mem layout | `src/a2a3/platform/include/common/dep_gen.h` | `DepGenRecord` (2624 B base, cache-line aligned, ≤64 inline explicit_deps) + `DepGenOverflowRecord` chain view (≤326 deps per slot) + SPSC ring + per-thread ready queue |
| AICPU writer | `src/a2a3/platform/{include,src}/aicpu/dep_gen_collector_aicpu.{h,cpp}` | Single-instance write path; weak-fallback exported to host build |
| Host collector | `src/a2a3/platform/{include/host,src/host}/dep_gen_collector.{h,cpp}` | `ProfilerBase<DepGenCollector, DepGenModule>` — drains ring → `records_` vector |
| Capture call site | `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` `submit_task_common` | One conditional block that snapshots inputs into the ring when `is_dep_gen_enabled()`; fires for both `submit_task` and `submit_dummy_task`. Dep-only tasks land in the record stream with valid tensor/dep info but no kernel_id field (the schema does not carry kernel_id), so replay treats them as ordinary dep nodes — viewers do not currently distinguish dummy from real tasks. |
| Replay | `src/a2a3/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.{h,cpp}` | Pure CPU; runs dual-pass differential replay — `compute_task_fanin` (oracle) + inlined STEP A/B mirror (annotated) against two `PTO2TensorMap` instances. Emits `deps.json` when both passes agree per record. |
| Device-runner hookup | `src/a2a3/platform/{onboard,sim}/host/device_runner.cpp` | post-`reconcile_counters` calls `dep_gen_replay_emit_deps_json(records.data(), records.size(), deps_path, nullptr)` |
| Shared-mem layout | `src/{a2a3,a5}/platform/include/common/dep_gen.h` | `DepGenRecord` (2624 B base, cache-line aligned, ≤64 inline explicit_deps) + `DepGenOverflowRecord` chain view (≤326 deps per slot) + SPSC ring + per-thread ready queue. Byte-identical layout across platforms. |
| AICPU writer | `src/{a2a3,a5}/platform/{include,src}/aicpu/dep_gen_collector_aicpu.{h,cpp}` | Single-instance write path; weak-fallback exported to host build. a5 reuses the a2a3 source verbatim — the writer accesses its own device-side view of shared memory, independent of how host↔device transport is implemented. |
| Host collector | `src/{a2a3,a5}/platform/{include/host,src/host}/dep_gen_collector.{h,cpp}` | `ProfilerBase<DepGenCollector, DepGenModule>` — drains ring → `records_` vector. a5 specializes the cpp for "no SVM": `alloc_single_buffer` malloc's a host shadow + `profiling_copy_to_device`, `reconcile_counters` explicitly `copy_from_device`'s the BufferState before reading, and `finalize` lets `BufferPoolManager::clear_mappings()` release all shadows as the single source of truth. |
| Capture call site | `src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` `submit_task_common` | One conditional block that snapshots inputs into the ring when `is_dep_gen_enabled()`; fires for both `submit_task` and `submit_dummy_task`. Dep-only tasks land in the record stream with valid tensor/dep info but no kernel_id field (the schema does not carry kernel_id), so replay treats them as ordinary dep nodes — viewers do not currently distinguish dummy from real tasks. |
| Replay | `src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.{h,cpp}` | Pure CPU; runs dual-pass differential replay — `compute_task_fanin` (oracle) + inlined STEP A/B mirror (annotated) against two `PTO2TensorMap` instances. Emits `deps.json` when both passes agree per record. Platform-agnostic — a5 reuses the a2a3 source verbatim. |
| Device-runner hookup | `src/{a2a3,a5}/platform/{onboard,sim}/host/device_runner.cpp` | post-`reconcile_counters` calls `dep_gen_replay_emit_deps_json(records.data(), records.size(), deps_path)` |
| Viewer | `simpler_setup/tools/deps_to_graph.py` | `deps.json` → pan/zoom HTML |
| Test | `tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/test_dep_gen_capture.py` | Smoke test + `fanout ⊆ deps` validation gate |
| Test | `tests/st/{a2a3,a5}/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py` + `test_dep_gen_chain.py` | Smoke test + 6-edge validation against `vector_example` orchestration (both platforms share byte-identical orchestration code). |

Currently a2a3 only; an a5 port is planned.
Supported on both a2a3 and a5. The a5 host collector differs from a2a3 only in its host↔device transport path (a5 has no SVM, so all transfers go through `profiling_copy_to_device` / `profiling_copy_from_device` instead of relying on `halHostRegister`'s shared mapping); the AICPU writer, shared-memory ABI, runtime call site, and replay are platform-agnostic.
118 changes: 118 additions & 0 deletions src/a5/platform/include/aicpu/dep_gen_collector_aicpu.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
/*
* Copyright (c) PyPTO Contributors.
* This program is free software, you can redistribute it and/or modify it under the terms and conditions of
* CANN Open Software License Agreement Version 2.0 (the "License").
* Please refer to the License for details. You may not use this file except in compliance with the License.
* THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
* INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
* See LICENSE in the root of the software repository for the full text of the License.
* -----------------------------------------------------------------------------------------------------------
*/

/**
* @file dep_gen_collector_aicpu.h
* @brief AICPU-side dep_gen (SubmitTrace) capture interface
*
* Lifecycle (called from aicpu_executor.cpp + pto_orchestrator.cpp):
* dep_gen_aicpu_set_orch_thread_idx() — record which AICPU thread runs the
* orchestrator (used to select the
* per-thread ready_queue on flush).
* dep_gen_aicpu_init() — pop the initial DepGenBuffer from
* the (single) instance's free_queue.
* [submit_task loop]
* dep_gen_aicpu_record_submit() — append one DepGenRecord; rotate
* buffer when full.
* dep_gen_aicpu_flush() — push current buffer (if non-empty)
* to ready_queue.
* dep_gen_aicpu_finalize() — clear bookkeeping.
*
* All-primitive interface (no runtime types in platform header):
* - task_id passed as raw uint64 (PTO2TaskId::raw)
* - tensor data passed via opaque void* pointers (memcpy'd into the
* DEP_GEN_TENSOR_SIZE-byte slot; static_asserted against sizeof(Tensor)
* in the .cpp)
* - explicit_deps passed as uint64*
*
* No-op when dep_gen is disabled (is_dep_gen_enabled() returns false).
*/

#ifndef PLATFORM_AICPU_DEP_GEN_COLLECTOR_AICPU_H_
#define PLATFORM_AICPU_DEP_GEN_COLLECTOR_AICPU_H_

#include <cstdint>

#include "common/dep_gen.h"

extern "C" void set_platform_dep_gen_base(uint64_t dep_gen_data_base);
extern "C" uint64_t get_platform_dep_gen_base();
extern "C" void set_dep_gen_enabled(bool enable);
extern "C" bool is_dep_gen_enabled();

/**
* Register the AICPU thread index that hosts the orchestrator. Used to select
* the per-thread ready_queue when buffers fill or on flush. Must be called by
* aicpu_executor.cpp before any dep_gen_aicpu_record_submit() can fire.
*
* Mirrors l2_perf_aicpu_set_orch_thread_idx().
*/
void dep_gen_aicpu_set_orch_thread_idx(int thread_idx);

/**
* Initialize dep_gen capture: pop the initial DepGenBuffer from the (single)
* orchestrator instance's free_queue and stash it as the current buffer.
*
* Pre-conditions:
* - Host has set the data base via set_platform_dep_gen_base()
* - dep_gen is enabled via set_dep_gen_enabled(true)
* - dep_gen_aicpu_set_orch_thread_idx() has been called
*
* If the free_queue is empty at init (host bug), the function leaves the
* current buffer as null and subsequent record_submit calls will bump
* dropped_record_count.
*/
void dep_gen_aicpu_init();

/**
* Append a base DepGenRecord (and zero or more DepGenOverflowRecord chain
* records) for a completed submit_task call. Switches buffer via the SPSC
* free_queue / ready_queue protocol when the current buffer cannot hold the
* full chain. No-op if dep_gen is disabled.
*
* Tensor handling: for slot i, if tensor_ptrs[i] is non-null, its first
* DEP_GEN_TENSOR_SIZE bytes are memcpy'd into record.tensors[i]. If null
* (e.g. arg_types[i] == OUTPUT, where the Tensor is materialized later by
* the runtime), the slot is left zeroed. Replay decides what to do with
* each slot based on arg_types[i].
*
* Dep handling: the first DEP_GEN_MAX_EXPLICIT_DEPS deps land in the base
* record; any excess spills into a chain of DepGenOverflowRecord slots. A
* submit whose chain would exceed the buffer's remaining capacity (even
* after switch) is truncated to fit; the dropped tail is logged.
*
* @param task_id_raw PTO2TaskId::raw (the assigned task_id for this submit)
* @param in_manual_scope true iff the submit happened inside a manual scope
* @param tensor_count Number of slots in tensor_ptrs / arg_types (≤ CORE_MAX_TENSOR_ARGS)
* @param tensor_ptrs Per-slot Tensor pointer (nullptr to skip the slot)
* @param arg_types Per-slot TensorArgType (interpreted as raw byte)
* @param explicit_dep_count Number of explicit_deps — no static cap; truncated only when the
* chain would not fit in a single DepGenBuffer
* @param explicit_deps_raw Per-dep PTO2TaskId::raw (length = explicit_dep_count)
*/
void dep_gen_aicpu_record_submit(
uint64_t task_id_raw, bool in_manual_scope, int tensor_count, const void *const *tensor_ptrs,
const uint8_t *arg_types, int explicit_dep_count, const uint64_t *explicit_deps_raw
);

/**
* Push the current (partially-filled) DepGenBuffer to the orchestrator
* thread's ready_queue so the host can pick it up. Called once at end of
* run, after the orchestrator's last submit.
*/
void dep_gen_aicpu_flush();

/**
* Clear file-local bookkeeping (current_buf cache, etc.). Called at shutdown.
*/
void dep_gen_aicpu_finalize();

#endif // PLATFORM_AICPU_DEP_GEN_COLLECTOR_AICPU_H_
Loading
Loading