hw-native-sys · indigo1973 · May 28, 2026
diff --git a/docs/dfx/dep_gen.md b/docs/dfx/dep_gen.md
@@ -74,10 +74,10 @@ is `--enable-dep-gen`:
 
 ```bash
 # Standalone
-python test_my_case.py --platform a2a3 --enable-dep-gen --enable-l2-swimlane
+python test_my_case.py --platform <a2a3|a5> --enable-dep-gen --enable-l2-swimlane
 
 # Pytest
-pytest tests/st/... --platform a2a3 --enable-dep-gen --enable-l2-swimlane
+pytest tests/st/... --platform <a2a3|a5> --enable-dep-gen --enable-l2-swimlane
 ```
 
 The `--enable-l2-swimlane` flag is independent but recommended in pair
@@ -295,10 +295,11 @@ underlying task-pair count.
 | `task.fanout[]` (L2PerfRecord) | Successors known at producer-retire time | **Yes** — sealed when producer retires |
 | `deps.json` (this feature) | Every consumer → producer reachable via tensormap / explicit_deps | No — replay sees every submit |
 
-`tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/test_dep_gen_capture.py`
-enforces `fanout ⊆ deps` as a validation gate: any edge in fanout that
-is missing from deps is a replay-side regression and fails the test.
-Cases where `deps - fanout ≠ ∅` are the dep_gen sweet spot — those are
+`tests/st/{a2a3,a5}/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py`
+enforces the 6-edge expectation against the `vector_example`
+orchestration as a validation gate: any edge missing from `deps.json`
+is a replay-side regression and fails the test. Cases where
+`deps - fanout ≠ ∅` are the dep_gen sweet spot — those are
 exactly the race-window edges fanout dropped. The
 `swimlane_converter.py` uses `deps.json` (when present) as the source
 of flow events in the Perfetto trace, and flags any edge whose
@@ -350,13 +351,13 @@ list; only the dep_gen replay graph loses the tail.
 
 | Layer | File | Role |
 | ----- | ---- | ---- |
-| Shared-mem layout | `src/a2a3/platform/include/common/dep_gen.h` | `DepGenRecord` (2624 B base, cache-line aligned, ≤64 inline explicit_deps) + `DepGenOverflowRecord` chain view (≤326 deps per slot) + SPSC ring + per-thread ready queue |
-| AICPU writer | `src/a2a3/platform/{include,src}/aicpu/dep_gen_collector_aicpu.{h,cpp}` | Single-instance write path; weak-fallback exported to host build |
-| Host collector | `src/a2a3/platform/{include/host,src/host}/dep_gen_collector.{h,cpp}` | `ProfilerBase<DepGenCollector, DepGenModule>` — drains ring → `records_` vector |
-| Capture call site | `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` `submit_task_common` | One conditional block that snapshots inputs into the ring when `is_dep_gen_enabled()`; fires for both `submit_task` and `submit_dummy_task`. Dep-only tasks land in the record stream with valid tensor/dep info but no kernel_id field (the schema does not carry kernel_id), so replay treats them as ordinary dep nodes — viewers do not currently distinguish dummy from real tasks. |
-| Replay | `src/a2a3/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.{h,cpp}` | Pure CPU; runs dual-pass differential replay — `compute_task_fanin` (oracle) + inlined STEP A/B mirror (annotated) against two `PTO2TensorMap` instances. Emits `deps.json` when both passes agree per record. |
-| Device-runner hookup | `src/a2a3/platform/{onboard,sim}/host/device_runner.cpp` | post-`reconcile_counters` calls `dep_gen_replay_emit_deps_json(records.data(), records.size(), deps_path, nullptr)` |
+| Shared-mem layout | `src/{a2a3,a5}/platform/include/common/dep_gen.h` | `DepGenRecord` (2624 B base, cache-line aligned, ≤64 inline explicit_deps) + `DepGenOverflowRecord` chain view (≤326 deps per slot) + SPSC ring + per-thread ready queue. Byte-identical layout across platforms. |
+| AICPU writer | `src/{a2a3,a5}/platform/{include,src}/aicpu/dep_gen_collector_aicpu.{h,cpp}` | Single-instance write path; weak-fallback exported to host build. a5 reuses the a2a3 source verbatim — the writer accesses its own device-side view of shared memory, independent of how host↔device transport is implemented. |
+| Host collector | `src/{a2a3,a5}/platform/{include/host,src/host}/dep_gen_collector.{h,cpp}` | `ProfilerBase<DepGenCollector, DepGenModule>` — drains ring → `records_` vector. a5 specializes the cpp for "no SVM": `alloc_single_buffer` malloc's a host shadow + `profiling_copy_to_device`, `reconcile_counters` explicitly `copy_from_device`'s the BufferState before reading, and `finalize` lets `BufferPoolManager::clear_mappings()` release all shadows as the single source of truth. |
+| Capture call site | `src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` `submit_task_common` | One conditional block that snapshots inputs into the ring when `is_dep_gen_enabled()`; fires for both `submit_task` and `submit_dummy_task`. Dep-only tasks land in the record stream with valid tensor/dep info but no kernel_id field (the schema does not carry kernel_id), so replay treats them as ordinary dep nodes — viewers do not currently distinguish dummy from real tasks. |
+| Replay | `src/{a2a3,a5}/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.{h,cpp}` | Pure CPU; runs dual-pass differential replay — `compute_task_fanin` (oracle) + inlined STEP A/B mirror (annotated) against two `PTO2TensorMap` instances. Emits `deps.json` when both passes agree per record. Platform-agnostic — a5 reuses the a2a3 source verbatim. |
+| Device-runner hookup | `src/{a2a3,a5}/platform/{onboard,sim}/host/device_runner.cpp` | post-`reconcile_counters` calls `dep_gen_replay_emit_deps_json(records.data(), records.size(), deps_path)` |
 | Viewer | `simpler_setup/tools/deps_to_graph.py` | `deps.json` → pan/zoom HTML |
-| Test | `tests/st/a2a3/tensormap_and_ringbuffer/dep_gen_capture/test_dep_gen_capture.py` | Smoke test + `fanout ⊆ deps` validation gate |
+| Test | `tests/st/{a2a3,a5}/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py` + `test_dep_gen_chain.py` | Smoke test + 6-edge validation against `vector_example` orchestration (both platforms share byte-identical orchestration code). |
 
-Currently a2a3 only; an a5 port is planned.
+Supported on both a2a3 and a5. The a5 host collector differs from a2a3 only in its host↔device transport path (a5 has no SVM, so all transfers go through `profiling_copy_to_device` / `profiling_copy_from_device` instead of relying on `halHostRegister`'s shared mapping); the AICPU writer, shared-memory ABI, runtime call site, and replay are platform-agnostic.
diff --git a/src/a5/platform/include/aicpu/dep_gen_collector_aicpu.h b/src/a5/platform/include/aicpu/dep_gen_collector_aicpu.h
@@ -0,0 +1,118 @@
+/*
+ * Copyright (c) PyPTO Contributors.
+ * This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+ * CANN Open Software License Agreement Version 2.0 (the "License").
+ * Please refer to the License for details. You may not use this file except in compliance with the License.
+ * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED,
+ * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
+ * See LICENSE in the root of the software repository for the full text of the License.
+ * -----------------------------------------------------------------------------------------------------------
+ */
+
+/**
+ * @file dep_gen_collector_aicpu.h
+ * @brief AICPU-side dep_gen (SubmitTrace) capture interface
+ *
+ * Lifecycle (called from aicpu_executor.cpp + pto_orchestrator.cpp):
+ *   dep_gen_aicpu_set_orch_thread_idx() — record which AICPU thread runs the
+ *                                         orchestrator (used to select the
+ *                                         per-thread ready_queue on flush).
+ *   dep_gen_aicpu_init()                — pop the initial DepGenBuffer from
+ *                                         the (single) instance's free_queue.
+ *   [submit_task loop]
+ *     dep_gen_aicpu_record_submit()     — append one DepGenRecord; rotate
+ *                                         buffer when full.
+ *   dep_gen_aicpu_flush()               — push current buffer (if non-empty)
+ *                                         to ready_queue.
+ *   dep_gen_aicpu_finalize()            — clear bookkeeping.
+ *
+ * All-primitive interface (no runtime types in platform header):
+ *   - task_id passed as raw uint64 (PTO2TaskId::raw)
+ *   - tensor data passed via opaque void* pointers (memcpy'd into the
+ *     DEP_GEN_TENSOR_SIZE-byte slot; static_asserted against sizeof(Tensor)
+ *     in the .cpp)
+ *   - explicit_deps passed as uint64*
+ *
+ * No-op when dep_gen is disabled (is_dep_gen_enabled() returns false).
+ */
+
+#ifndef PLATFORM_AICPU_DEP_GEN_COLLECTOR_AICPU_H_
+#define PLATFORM_AICPU_DEP_GEN_COLLECTOR_AICPU_H_
+
+#include <cstdint>
+
+#include "common/dep_gen.h"
+
+extern "C" void set_platform_dep_gen_base(uint64_t dep_gen_data_base);
+extern "C" uint64_t get_platform_dep_gen_base();
+extern "C" void set_dep_gen_enabled(bool enable);
+extern "C" bool is_dep_gen_enabled();
+
+/**
+ * Register the AICPU thread index that hosts the orchestrator. Used to select
+ * the per-thread ready_queue when buffers fill or on flush. Must be called by
+ * aicpu_executor.cpp before any dep_gen_aicpu_record_submit() can fire.
+ *
+ * Mirrors l2_perf_aicpu_set_orch_thread_idx().
+ */
+void dep_gen_aicpu_set_orch_thread_idx(int thread_idx);
+
+/**
+ * Initialize dep_gen capture: pop the initial DepGenBuffer from the (single)
+ * orchestrator instance's free_queue and stash it as the current buffer.
+ *
+ * Pre-conditions:
+ *   - Host has set the data base via set_platform_dep_gen_base()
+ *   - dep_gen is enabled via set_dep_gen_enabled(true)
+ *   - dep_gen_aicpu_set_orch_thread_idx() has been called
+ *
+ * If the free_queue is empty at init (host bug), the function leaves the
+ * current buffer as null and subsequent record_submit calls will bump
+ * dropped_record_count.
+ */
+void dep_gen_aicpu_init();
+
+/**
+ * Append a base DepGenRecord (and zero or more DepGenOverflowRecord chain
+ * records) for a completed submit_task call. Switches buffer via the SPSC
+ * free_queue / ready_queue protocol when the current buffer cannot hold the
+ * full chain. No-op if dep_gen is disabled.
+ *
+ * Tensor handling: for slot i, if tensor_ptrs[i] is non-null, its first
+ * DEP_GEN_TENSOR_SIZE bytes are memcpy'd into record.tensors[i]. If null
+ * (e.g. arg_types[i] == OUTPUT, where the Tensor is materialized later by
+ * the runtime), the slot is left zeroed. Replay decides what to do with
+ * each slot based on arg_types[i].
+ *
+ * Dep handling: the first DEP_GEN_MAX_EXPLICIT_DEPS deps land in the base
+ * record; any excess spills into a chain of DepGenOverflowRecord slots. A
+ * submit whose chain would exceed the buffer's remaining capacity (even
+ * after switch) is truncated to fit; the dropped tail is logged.
+ *
+ * @param task_id_raw         PTO2TaskId::raw (the assigned task_id for this submit)
+ * @param in_manual_scope     true iff the submit happened inside a manual scope
+ * @param tensor_count        Number of slots in tensor_ptrs / arg_types (≤ CORE_MAX_TENSOR_ARGS)
+ * @param tensor_ptrs         Per-slot Tensor pointer (nullptr to skip the slot)
+ * @param arg_types           Per-slot TensorArgType (interpreted as raw byte)
+ * @param explicit_dep_count  Number of explicit_deps — no static cap; truncated only when the
+ *                            chain would not fit in a single DepGenBuffer
+ * @param explicit_deps_raw   Per-dep PTO2TaskId::raw (length = explicit_dep_count)
+ */
+void dep_gen_aicpu_record_submit(
+    uint64_t task_id_raw, bool in_manual_scope, int tensor_count, const void *const *tensor_ptrs,
+    const uint8_t *arg_types, int explicit_dep_count, const uint64_t *explicit_deps_raw
+);
+
+/**
+ * Push the current (partially-filled) DepGenBuffer to the orchestrator
+ * thread's ready_queue so the host can pick it up. Called once at end of
+ * run, after the orchestrator's last submit.
+ */
+void dep_gen_aicpu_flush();
+
+/**
+ * Clear file-local bookkeeping (current_buf cache, etc.). Called at shutdown.
+ */
+void dep_gen_aicpu_finalize();
+
+#endif  // PLATFORM_AICPU_DEP_GEN_COLLECTOR_AICPU_H_