Skip to content

[WIP] Add: dep_gen capture+replay support on a5#886

Open
indigo1973 wants to merge 1 commit into
hw-native-sys:mainfrom
indigo1973:dep_0527
Open

[WIP] Add: dep_gen capture+replay support on a5#886
indigo1973 wants to merge 1 commit into
hw-native-sys:mainfrom
indigo1973:dep_0527

Conversation

@indigo1973
Copy link
Copy Markdown
Contributor

Port the dep_gen (SubmitTrace) feature from a2a3 to a5 so the tensormap_and_ringbuffer runtime on a5 can produce deps.json and feed flow events into swimlane_converter.py. Without this, --enable-dep-gen was a no-op on a5 and merged_swimlane_*.json had no dependency arrows.

Reused from a2a3 verbatim (byte-identical):

  • Shared-memory ABI: common/dep_gen.h (DepGenRecord 2624 B, overflow chain, SPSC free_queue, per-thread ready_queue)
  • AICPU writer: aicpu/dep_gen_collector_aicpu.{h,cpp}
  • Runtime replay: runtime/tensormap_and_ringbuffer/host/dep_gen_replay
  • Orchestrator capture point + aicpu_executor lifecycle hooks
  • 5 platform_config constants + PROFILING_FLAG_DEP_GEN bit

Specialized for a5 (no SVM, see profiling_common diff vs a2a3):

  • dep_gen_collector.cpp uses alloc_single_buffer (malloc shadow + profiling_copy_to_device) instead of identity-mapping when register_cb is null — matches a5's PMU/L2Perf/Dump collectors.
  • Two-phase set_memory_context: callbacks first, then shm pointers once the region is committed, so start(tf) gates correctly.
  • reconcile_counters explicitly copy_from_device's the BufferState + current_buf before reading (mgmt thread is stopped by then).
  • finalize lets BufferPoolManager::clear_mappings() be the single source of truth for host-shadow lifetime — no per-collector dedup.

Sim path: dlsym set_platform_dep_gen_base / set_dep_gen_enabled out of the AICPU .so and forward kernel_args.dep_gen_data_base + enable flag at boot, mirroring the existing pmu / dump / l2_perf setters.

Onboard kernel.cpp adds two lines to forward dep_gen_data_base + PROFILING_FLAG_DEP_GEN into the AICPU writer's globals, mirroring the existing PMU / L2 / Dump setters.

c_api: run_prepared's enable_dep_gen parameter is no longer ignored — wired to runner->set_dep_gen_enabled() on both onboard and sim.

Tests:

  • tests/st/a5/.../dfx/dep_gen/test_dep_gen.py: 6-edge validation against vector_example orchestration (byte-identical to a2a3 — same expected edge set).
  • tests/st/a5/.../dfx/dep_gen/test_dep_gen_chain.py: overflow chain regression for >64 explicit deps.

Docs:

Port the dep_gen (SubmitTrace) feature from a2a3 to a5 so the
tensormap_and_ringbuffer runtime on a5 can produce deps.json and feed
flow events into swimlane_converter.py. Without this, --enable-dep-gen
was a no-op on a5 and merged_swimlane_*.json had no dependency arrows.

Reused from a2a3 verbatim (byte-identical):
  - Shared-memory ABI: common/dep_gen.h (DepGenRecord 2624 B, overflow
    chain, SPSC free_queue, per-thread ready_queue)
  - AICPU writer: aicpu/dep_gen_collector_aicpu.{h,cpp}
  - Runtime replay: runtime/tensormap_and_ringbuffer/host/dep_gen_replay
  - Orchestrator capture point + aicpu_executor lifecycle hooks
  - 5 platform_config constants + PROFILING_FLAG_DEP_GEN bit

Specialized for a5 (no SVM, see profiling_common diff vs a2a3):
  - dep_gen_collector.cpp uses alloc_single_buffer (malloc shadow +
    profiling_copy_to_device) instead of identity-mapping when
    register_cb is null — matches a5's PMU/L2Perf/Dump collectors.
  - Two-phase set_memory_context: callbacks first, then shm pointers
    once the region is committed, so start(tf) gates correctly.
  - reconcile_counters explicitly copy_from_device's the BufferState +
    current_buf before reading (mgmt thread is stopped by then).
  - finalize lets BufferPoolManager::clear_mappings() be the single
    source of truth for host-shadow lifetime — no per-collector dedup.

Sim path: dlsym set_platform_dep_gen_base / set_dep_gen_enabled out of
the AICPU .so and forward kernel_args.dep_gen_data_base + enable flag
at boot, mirroring the existing pmu / dump / l2_perf setters.

Onboard kernel.cpp adds two lines to forward dep_gen_data_base +
PROFILING_FLAG_DEP_GEN into the AICPU writer's globals, mirroring the
existing PMU / L2 / Dump setters.

c_api: run_prepared's enable_dep_gen parameter is no longer ignored —
wired to runner->set_dep_gen_enabled() on both onboard and sim.

Tests:
  - tests/st/a5/.../dfx/dep_gen/test_dep_gen.py: 6-edge validation
    against vector_example orchestration (byte-identical to a2a3 — same
    expected edge set).
  - tests/st/a5/.../dfx/dep_gen/test_dep_gen_chain.py: overflow chain
    regression for >64 explicit deps.

Docs:
  - docs/dfx/dep_gen.md: §8 Architecture Touchpoints now lists both
    platforms; "Currently a2a3 only" line removed.
  - src/a5/runtime/.../docs/profiling_levels.md: Code Locations point
    at src/a5/ (was stale src/a2a3/ refs from PR hw-native-sys#777 cleanup) and
    add a dep_gen entry.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces a complete dependency-generation (DepGen) capture and replay system for the a5 platform. It enables offline analysis of orchestrator task submission graphs by capturing per-submit metadata (task IDs, tensor references, explicit dependencies) into device-resident buffers, transferring completed buffers to the host, and replaying them to generate a deps.json artifact validated against the tensormap engine.

Changes

DepGen Dependency Capture and Replay

Layer / File(s) Summary
Data contracts and platform configuration
src/a5/platform/include/common/dep_gen.h, src/a5/platform/include/common/kernel_args.h, src/a5/platform/include/common/platform_config.h
Defines shared-memory record layout (DepGenRecord, overflow chains, free/ready queues), buffer state, and profiling flags; adds dep_gen_data_base to KernelArgs and platform-sizing constants.
AICPU capture interface and implementation
src/a5/platform/include/aicpu/dep_gen_collector_aicpu.h, src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp, src/a5/platform/onboard/aicpu/kernel.cpp
C-ABI capture interface; device-side lifecycle (init, record_submit with overflow handling, flush, finalize); wires base pointer and enable flag in kernel execution.
Orchestrator submission capture
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp, src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
Snapshot task identity/tensors/deps before tensormap lookup; orchestrator init/flush/finalize lifecycle; fallback stubs for host builds.
Host-side buffer collection and state reconciliation
src/a5/platform/include/host/dep_gen_collector.h, src/a5/platform/src/host/dep_gen_collector.cpp, src/a5/platform/onboard/host/CMakeLists.txt, src/a5/platform/sim/host/CMakeLists.txt
ProfilerBase<DepGenCollector>-derived class allocating device/host buffer pairs; accumulates in-memory records; reconciles device counters for consistency checking.
Replay and deps.json generation
src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.h, src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp
Consumes captured records; builds oracle and annotated tensormap instances in parallel; validates edge producer-id consistency; serializes JSON with tensor metadata and overlap flags.
Onboard device runner integration
src/a5/platform/onboard/host/device_runner.h, src/a5/platform/onboard/host/device_runner.cpp, src/a5/platform/onboard/host/pto_runtime_c_api.cpp
Enables dep-gen via C API flag; initializes collector and wires base pointer into kernel_args; starts/stops collection; triggers replay after reconciliation.
Simulation device runner integration
src/a5/platform/sim/host/device_runner.h, src/a5/platform/sim/host/device_runner.cpp, src/a5/platform/sim/host/pto_runtime_c_api.cpp
Dynamically resolves AICPU dep-gen control functions; conditionally enables profiling flag and collector; reconciles counters and replays with same lifecycle.
Test cases and orchestration kernels
tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/kernels/orchestration/chain_barrier_orch.cpp, tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py, tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen_chain.py
Vector example validating 6-edge baseline; chain_barrier kernel and test validating overflow chains for 64–391 producers; schema sanity checks and deps_to_graph smoke tests.
Documentation updates
docs/dfx/dep_gen.md, src/a5/runtime/tensormap_and_ringbuffer/docs/profiling_levels.md
Enablement examples for a2a3|a5; architecture touchpoints mapping capture/replay/collection components; validation-gate expectations updated.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A rabbit captured the task flows so true,
Device to host, dependencies brew,
With overflow chains and tensormap sight,
Dependencies written in JSON light,
SubmitTrace magic—the graph takes flight! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.49% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description is directly related to the changeset, explaining the purpose (porting dep_gen from a2a3 to a5), what was reused, a5-specific adaptations, integration changes, tests, and documentation updates.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title '[WIP] Add: dep_gen capture+replay support on a5' accurately summarizes the main objective of the changeset—implementing dep_gen capture and replay functionality on the a5 platform.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ports the dep_gen (SubmitTrace) capture and offline replay infrastructure to the a5 platform, mirroring the a2a3 implementation. It introduces the shared-memory structures, the AICPU writer, the host collector, and the host-side replay mechanism that performs a differential check to emit deps.json, along with corresponding integration tests. The review feedback highlights critical concurrency and memory management issues that must be addressed: several incorrect or missing memory barriers (wmb() and rmb()) in the AICPU writer could lead to stale reads on weakly-ordered architectures, a memory leak exists where std::malloc'd host shadows are not freed during finalization, and a data race on total_collected_ requires it to be declared as atomic.

Comment thread src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp
Comment thread src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp
Comment on lines +61 to +71
} else {
host_ptr = std::malloc(size);
if (host_ptr == nullptr) {
LOG_ERROR("DepGenCollector: host shadow alloc failed for %zu bytes", size);
free_cb_(dev_ptr);
if (host_ptr_out) *host_ptr_out = nullptr;
return nullptr;
}
std::memset(host_ptr, 0, size);
profiling_copy_to_device(dev_ptr, host_ptr, size);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When register_cb_ is null (which is always the case on the a5 platform), host shadows are allocated via std::malloc in alloc_single_buffer. However, in finalize(), manager_.clear_mappings() is called to clear the mappings, but it does not free the std::malloc'd host pointers because it cannot unconditionally call std::free without crashing registered mappings. This causes a silent and major memory leak of all host shadows and the SHM host region on every run. We should track the std::malloc'd host pointers in a private vector malloced_host_ptrs_ and free them in finalize().

Suggested change
} else {
host_ptr = std::malloc(size);
if (host_ptr == nullptr) {
LOG_ERROR("DepGenCollector: host shadow alloc failed for %zu bytes", size);
free_cb_(dev_ptr);
if (host_ptr_out) *host_ptr_out = nullptr;
return nullptr;
}
std::memset(host_ptr, 0, size);
profiling_copy_to_device(dev_ptr, host_ptr, size);
}
} else {
host_ptr = std::malloc(size);
if (host_ptr == nullptr) {
LOG_ERROR("DepGenCollector: host shadow alloc failed for %zu bytes", size);
free_cb_(dev_ptr);
if (host_ptr_out) *host_ptr_out = nullptr;
return nullptr;
}
std::memset(host_ptr, 0, size);
profiling_copy_to_device(dev_ptr, host_ptr, size);
malloced_host_ptrs_.push_back(host_ptr);
}

Comment on lines +313 to +314
// Free remaining host shadows (per-state buffers + shm region).
manager_.clear_mappings();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To prevent memory leaks of the host shadows allocated via std::malloc when register_cb_ == nullptr, we must explicitly free them in finalize() after clearing the mappings.

Suggested change
// Free remaining host shadows (per-state buffers + shm region).
manager_.clear_mappings();
// Free remaining host shadows (per-state buffers + shm region).
manager_.clear_mappings();
for (void* ptr : malloced_host_ptrs_) {
std::free(ptr);
}
malloced_host_ptrs_.clear();

Comment thread src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp
Comment thread src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp
Comment on lines +142 to +147
rmb();
uint32_t head = s_dep_gen_state->free_queue.head;
uint32_t tail = s_dep_gen_state->free_queue.tail;

if (head != tail) {
uint64_t buf_ptr = s_dep_gen_state->free_queue.buffer_ptrs[head % PLATFORM_DEP_GEN_SLOT_COUNT];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The read memory barrier rmb() must be placed after reading tail and before reading buffer_ptrs to prevent speculative loads of stale data on weakly-ordered architectures.

Suggested change
rmb();
uint32_t head = s_dep_gen_state->free_queue.head;
uint32_t tail = s_dep_gen_state->free_queue.tail;
if (head != tail) {
uint64_t buf_ptr = s_dep_gen_state->free_queue.buffer_ptrs[head % PLATFORM_DEP_GEN_SLOT_COUNT];
uint32_t head = s_dep_gen_state->free_queue.head;
uint32_t tail = s_dep_gen_state->free_queue.tail;
if (head != tail) {
rmb();
uint64_t buf_ptr = s_dep_gen_state->free_queue.buffer_ptrs[head % PLATFORM_DEP_GEN_SLOT_COUNT];
References
  1. On weakly-ordered architectures, ensure that a read memory barrier is explicitly placed between an MMIO read and a subsequent Normal memory read when there is no data or address dependency.

// Running total of records appended. Equal to ``records_.size()`` after
// every append; kept separately for the reconcile_counters cross-check
// even when records_ may be inspected concurrently.
uint64_t total_collected_ = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The member variable total_collected_ is updated on the background management thread under records_mutex_ but read concurrently on other threads via the public getter total_collected() without any synchronization. This constitutes a data race under the C++ memory model and can lead to torn reads on 32-bit platforms or undefined behavior. Declaring total_collected_ as std::atomic<uint64_t> resolves this safely.

Suggested change
uint64_t total_collected_ = 0;
std::atomic<uint64_t> total_collected_ = 0;
std::vector<void*> malloced_host_ptrs_;
References
  1. When a component is accessed by multiple threads, use std::atomic with release-store and acquire-load semantics to establish a happens-before relationship.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (2)
tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen_chain.py (1)

200-206: ⚡ Quick win

Tighten explicit-edge validation to reduce false positives.

Add a strict explicit-edge count (n + 1) and validate the barrier’s single outgoing explicit edge is not self-looping or targeting a producer.

Proposed fix
         # All N producer→barrier edges must be present. This is the chain
         # round-trip assertion: pre-chain code drops anything past index 63.
         assert len(barrier_preds) == n, f"barrier has {len(barrier_preds)} preds, expected {n}"
+        assert len(explicit_edges) == n + 1, (
+            f"expected exactly {n + 1} explicit edges (N producer->barrier + 1 barrier->consumer), "
+            f"got {len(explicit_edges)}"
+        )
 
         # Consumer must explicit-depend on the barrier — exactly one outgoing
         # explicit edge from the barrier.
         outgoing_explicit_from_barrier = {succ for pred, succ in explicit_edges if pred == barrier_id}
         assert len(outgoing_explicit_from_barrier) == 1, (
             f"barrier {barrier_id} has {len(outgoing_explicit_from_barrier)} outgoing explicit edges, "
             f"expected 1 (the consumer)"
         )
+        consumer_id = next(iter(outgoing_explicit_from_barrier))
+        assert consumer_id != barrier_id and consumer_id not in barrier_preds, (
+            f"barrier {barrier_id} outgoing explicit edge points to invalid consumer candidate {consumer_id}"
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen_chain.py`
around lines 200 - 206, Tighten the barrier explicit-edge validation: compute
outgoing_explicit_from_barrier from explicit_edges (as shown), assert its size
equals (len(producers) + 1) instead of 1, then extract the single non-barrier
consumer target and ensure it is not a self-loop (target != barrier_id) and not
in the producers set; update the assertion message to include the expected count
and offending targets using outgoing_explicit_from_barrier, barrier_id, and
producers for diagnostics.
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp (1)

42-50: ⚡ Quick win

Add a compile-time guard for arg-slot-count parity.

This block asserts sizeof(Tensor), but the capture path also assumes MAX_TENSOR_ARGS == CORE_MAX_TENSOR_ARGS. If those ever drift, dep_gen will silently truncate/mis-shape records. A sibling static_assert would catch that at build time.

Suggested guard
 static_assert(sizeof(Tensor) == DEP_GEN_TENSOR_SIZE, "DepGenRecord::tensors slot size out of sync with sizeof(Tensor)");
+static_assert(MAX_TENSOR_ARGS == CORE_MAX_TENSOR_ARGS, "DepGen arg slot count out of sync with shared-memory ABI");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` around
lines 42 - 50, Add a compile-time check that ensures the capture path's assumed
arg-slot parity by static_assert-ing MAX_TENSOR_ARGS == CORE_MAX_TENSOR_ARGS so
drift is caught at build time; locate the existing Tensor size static_assert
near DepGenRecord/tensors in pto_orchestrator.cpp and add a sibling
static_assert referencing the macros MAX_TENSOR_ARGS and CORE_MAX_TENSOR_ARGS
with a clear error message like "tensor arg slot count mismatch: MAX_TENSOR_ARGS
!= CORE_MAX_TENSOR_ARGS".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/a5/platform/onboard/host/device_runner.cpp`:
- Around line 753-761: The code currently only logs when
dep_gen_replay_emit_deps_json fails; instead treat that as a fatal error: after
calling dep_gen_replay_emit_deps_json (inside the enable_dep_gen_ block where
dep_gen_collector_.reconcile_counters() is true), check replay_rc and on
non-zero either return a non-zero error code from this function (or set the
overall run/exit status variable and bail out) so the caller sees failure;
update the branch that calls dep_gen_replay_emit_deps_json (symbols:
enable_dep_gen_, dep_gen_collector_.reconcile_counters(),
dep_gen_collector_.records(), make_deps_json_path(),
dep_gen_replay_emit_deps_json) to propagate the error instead of only LOG_ERROR.

In `@src/a5/platform/sim/host/device_runner.cpp`:
- Around line 709-717: The dep-gen path currently logs an error but does not
stop the simulation when dep_gen_replay_emit_deps_json fails; update the block
in device_runner.cpp (around enable_dep_gen_, dep_gen_collector_.stop(),
dep_gen_collector_.reconcile_counters(), make_deps_json_path,
dep_gen_replay_emit_deps_json) so that if replay_rc != 0 you propagate failure
instead of only logging: e.g., set the function/result state to indicate failure
(return an error code or throw an exception / call the routine that aborts the
run used by this module) so the caller sees the failure and the sim run
terminates when deps.json emission fails.

In `@src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp`:
- Around line 72-75: The host-visible fields are being published before the
payload/reset is visible; move the publish points so that the payload and reset
are fully committed before updating host-visible pointers: ensure the per-queue
entry fields (s_dep_gen_header->queues[q][current_tail].instance_index,
.buffer_ptr, .buffer_seq) and buffer reset (current_buf_ptr, buf->count) are
written, then execute the memory barrier (wmb()) and only after that assign the
host-visible tail pointer (s_dep_gen_header->queue_tails[q]) and any other
host-visible pointer updates; apply the same reorder to the other occurrences
involving s_dep_gen_header, queue_tails, current_buf_ptr and buf->count (also at
the other noted spots).

In `@src/a5/platform/src/host/dep_gen_collector.cpp`:
- Line 38: DepGenCollector::init() must roll back any successfully created
buffers/mappings if a later alloc_single_buffer() fails: on error, iterate the
list of already-allocated entries (the same containers used in init()),
unmap/free each buffer, close any fds, and remove/clear those entries so no
partial state remains; ensure initialized_ stays false and any temporary
resources are released (or alternatively set initialized_ = true and call
finalize() only after the partial state has been made consistent) so the
destructor/stop() path won't leak. Reference symbols: DepGenCollector::init(),
alloc_single_buffer(), finalize(), DepGenCollector::~DepGenCollector(),
initialized_, stop().

In `@src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp`:
- Around line 568-607: The code currently logs orphan/unterminated overflow
chains inside the loop (checking DEP_GEN_FLAG_OVERFLOW and matching task_id) and
then continues, allowing a truncated deps list to be used; instead, when
encountering these malformed chains (or when chain_complete is false after
scanning overflow records) abort the replay immediately rather than proceeding:
in the block handling orphan overflow and the block handling unterminated
chains, replace the LOG_ERROR-only behavior with a hard failure (e.g., return an
error status or throw an exception) from the enclosing function so that
DepGenRecord/DepGenOverflowRecord chains that are malformed do not lead to using
full_deps_buf/deps_data; ensure the failure path prevents setting deps_data and
propagates a clear error for the caller to detect.
- Around line 540-543: The loop currently reinterprets raw bytes in
DepGenRecord::tensors as const Tensor* which risks object-lifetime UB; instead
allocate a temporary array/vector of real Tensor objects (e.g.
std::vector<Tensor> replay_tensors(tc) or an aligned buffer of Tensor) and for
each i do a memcpy(&replay_tensors[i], &rec.tensors[i][0], sizeof(Tensor)) to
materialize a real aligned Tensor object, then set tref_buf[i].ptr =
&replay_tensors[i] and atype_buf[i] as before (ensure replay_tensors lives long
enough for the replay usage).

In `@src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp`:
- Around line 488-512: The code currently records the raw
args.explicit_deps_data()/args.explicit_dep_count() into
dep_gen_aicpu_record_submit, but you must record the runtime-filtered
explicit-deps list (the array and count produced after invalid/already-dead deps
are dropped) so replay matches actual enforced edges; update the
dep_gen_aicpu_record_submit calls (the one shown and the similar call around the
534-551 region) to pass the filtered deps buffer and its filtered count instead
of args.explicit_deps_data() and args.explicit_dep_count(), using the same
filtered-deps variable(s) produced by the runtime’s dep-filtering code path.

In `@tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py`:
- Around line 139-142: The code currently returns silently when deps_path (the
expected deps.json) is missing, which hides failures when dep_gen is enabled;
replace the silent return with a fail-fast check (e.g. assert
deps_path.exists(), f"dep_gen enabled but deps.json missing at {deps_path}" or
pytest.fail(...) ) so the test fails with a clear message; ensure pytest is
imported if you use pytest.fail and reference the symbols deps_path, deps.json
and dep_gen in the failure message to aid debugging.

---

Nitpick comments:
In `@src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp`:
- Around line 42-50: Add a compile-time check that ensures the capture path's
assumed arg-slot parity by static_assert-ing MAX_TENSOR_ARGS ==
CORE_MAX_TENSOR_ARGS so drift is caught at build time; locate the existing
Tensor size static_assert near DepGenRecord/tensors in pto_orchestrator.cpp and
add a sibling static_assert referencing the macros MAX_TENSOR_ARGS and
CORE_MAX_TENSOR_ARGS with a clear error message like "tensor arg slot count
mismatch: MAX_TENSOR_ARGS != CORE_MAX_TENSOR_ARGS".

In `@tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen_chain.py`:
- Around line 200-206: Tighten the barrier explicit-edge validation: compute
outgoing_explicit_from_barrier from explicit_edges (as shown), assert its size
equals (len(producers) + 1) instead of 1, then extract the single non-barrier
consumer target and ensure it is not a self-loop (target != barrier_id) and not
in the producers set; update the assertion message to include the expected count
and offending targets using outgoing_explicit_from_barrier, barrier_id, and
producers for diagnostics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8b369d14-d666-4b3b-98ae-5e9b838f7773

📥 Commits

Reviewing files that changed from the base of the PR and between 61ba501 and 6ba5fa6.

📒 Files selected for processing (25)
  • docs/dfx/dep_gen.md
  • src/a5/platform/include/aicpu/dep_gen_collector_aicpu.h
  • src/a5/platform/include/common/dep_gen.h
  • src/a5/platform/include/common/kernel_args.h
  • src/a5/platform/include/common/platform_config.h
  • src/a5/platform/include/host/dep_gen_collector.h
  • src/a5/platform/onboard/aicpu/kernel.cpp
  • src/a5/platform/onboard/host/CMakeLists.txt
  • src/a5/platform/onboard/host/device_runner.cpp
  • src/a5/platform/onboard/host/device_runner.h
  • src/a5/platform/onboard/host/pto_runtime_c_api.cpp
  • src/a5/platform/sim/host/CMakeLists.txt
  • src/a5/platform/sim/host/device_runner.cpp
  • src/a5/platform/sim/host/device_runner.h
  • src/a5/platform/sim/host/pto_runtime_c_api.cpp
  • src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp
  • src/a5/platform/src/host/dep_gen_collector.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/docs/profiling_levels.md
  • src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp
  • src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.h
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
  • tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/kernels/orchestration/chain_barrier_orch.cpp
  • tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py
  • tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen_chain.py

Comment on lines +753 to +761
if (enable_dep_gen_) {
dep_gen_collector_.stop();
if (dep_gen_collector_.reconcile_counters()) {
const auto &records = dep_gen_collector_.records();
const std::string deps = make_deps_json_path(output_prefix_);
int replay_rc = dep_gen_replay_emit_deps_json(records.data(), records.size(), deps.c_str());
if (replay_rc != 0) {
LOG_ERROR("dep_gen replay failed (%d) — deps.json not produced", replay_rc);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail the run when dep-gen emission fails.

When enable_dep_gen_ is on, missing deps.json is a feature failure, not just a log message. Returning success here hides the regression from callers, and the current dep-gen tests only inspect deps.json when it exists.

Suggested fix
         int replay_rc = dep_gen_replay_emit_deps_json(records.data(), records.size(), deps.c_str());
         if (replay_rc != 0) {
             LOG_ERROR("dep_gen replay failed (%d) — deps.json not produced", replay_rc);
+            return replay_rc;
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (enable_dep_gen_) {
dep_gen_collector_.stop();
if (dep_gen_collector_.reconcile_counters()) {
const auto &records = dep_gen_collector_.records();
const std::string deps = make_deps_json_path(output_prefix_);
int replay_rc = dep_gen_replay_emit_deps_json(records.data(), records.size(), deps.c_str());
if (replay_rc != 0) {
LOG_ERROR("dep_gen replay failed (%d) — deps.json not produced", replay_rc);
}
if (enable_dep_gen_) {
dep_gen_collector_.stop();
if (dep_gen_collector_.reconcile_counters()) {
const auto &records = dep_gen_collector_.records();
const std::string deps = make_deps_json_path(output_prefix_);
int replay_rc = dep_gen_replay_emit_deps_json(records.data(), records.size(), deps.c_str());
if (replay_rc != 0) {
LOG_ERROR("dep_gen replay failed (%d) — deps.json not produced", replay_rc);
return replay_rc;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/onboard/host/device_runner.cpp` around lines 753 - 761, The
code currently only logs when dep_gen_replay_emit_deps_json fails; instead treat
that as a fatal error: after calling dep_gen_replay_emit_deps_json (inside the
enable_dep_gen_ block where dep_gen_collector_.reconcile_counters() is true),
check replay_rc and on non-zero either return a non-zero error code from this
function (or set the overall run/exit status variable and bail out) so the
caller sees failure; update the branch that calls dep_gen_replay_emit_deps_json
(symbols: enable_dep_gen_, dep_gen_collector_.reconcile_counters(),
dep_gen_collector_.records(), make_deps_json_path(),
dep_gen_replay_emit_deps_json) to propagate the error instead of only LOG_ERROR.

Comment on lines +709 to +717
if (enable_dep_gen_) {
dep_gen_collector_.stop();
if (dep_gen_collector_.reconcile_counters()) {
const auto &records = dep_gen_collector_.records();
const std::string deps = make_deps_json_path(output_prefix_);
int replay_rc = dep_gen_replay_emit_deps_json(records.data(), records.size(), deps.c_str());
if (replay_rc != 0) {
LOG_ERROR("dep_gen replay failed (%d) — deps.json not produced", replay_rc);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail the sim run when dep-gen emission fails.

This has the same silent-failure problem as the onboard path: enable_dep_gen_ can succeed from the caller’s point of view even though deps.json was never written. That makes dep-gen regressions easy to miss.

Suggested fix
         int replay_rc = dep_gen_replay_emit_deps_json(records.data(), records.size(), deps.c_str());
         if (replay_rc != 0) {
             LOG_ERROR("dep_gen replay failed (%d) — deps.json not produced", replay_rc);
+            return replay_rc;
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/sim/host/device_runner.cpp` around lines 709 - 717, The
dep-gen path currently logs an error but does not stop the simulation when
dep_gen_replay_emit_deps_json fails; update the block in device_runner.cpp
(around enable_dep_gen_, dep_gen_collector_.stop(),
dep_gen_collector_.reconcile_counters(), make_deps_json_path,
dep_gen_replay_emit_deps_json) so that if replay_rc != 0 you propagate failure
instead of only logging: e.g., set the function/result state to indicate failure
(return an error code or throw an exception / call the routine that aborts the
run used by this module) so the caller sees the failure and the sim run
terminates when deps.json emission fails.

Comment on lines +72 to +75
s_dep_gen_header->queues[q][current_tail].instance_index = 0;
s_dep_gen_header->queues[q][current_tail].buffer_ptr = buffer_ptr;
s_dep_gen_header->queues[q][current_tail].buffer_seq = buffer_seq;
s_dep_gen_header->queue_tails[q] = next_tail;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Publish consumer-visible state only after the payload/reset is visible.

queue_tails[q], current_buf_ptr, and buf->count are the host-visible publish points here, but each is stored before the wmb(). That lets the host observe a ready entry before its fields land, or see a recycled/current buffer before count has been reset, or copy records after count increases but before the payload is fully committed.

Suggested ordering fix
     s_dep_gen_header->queues[q][current_tail].instance_index = 0;
     s_dep_gen_header->queues[q][current_tail].buffer_ptr = buffer_ptr;
     s_dep_gen_header->queues[q][current_tail].buffer_seq = buffer_seq;
+    wmb();
     s_dep_gen_header->queue_tails[q] = next_tail;
-    s_dep_gen_state->current_buf_ptr = new_buf_ptr;
-    s_dep_gen_state->current_buf_seq = seq + 1;
-    wmb();
-
     DepGenBuffer *new_buf = reinterpret_cast<DepGenBuffer *>(new_buf_ptr);
     new_buf->count = 0;
+    wmb();
+    s_dep_gen_state->current_buf_ptr = new_buf_ptr;
+    s_dep_gen_state->current_buf_seq = seq + 1;
-        s_dep_gen_state->current_buf_ptr = buf_ptr;
-        s_dep_gen_state->current_buf_seq = 0;
-        wmb();
         DepGenBuffer *buf = reinterpret_cast<DepGenBuffer *>(buf_ptr);
         buf->count = 0;
+        wmb();
+        s_dep_gen_state->current_buf_ptr = buf_ptr;
+        s_dep_gen_state->current_buf_seq = 0;
-    buf->count = idx + static_cast<uint32_t>(needed);
-    wmb();
+    wmb();
+    buf->count = idx + static_cast<uint32_t>(needed);

Also applies to: 121-126, 149-154, 321-322

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp` around lines 72 - 75,
The host-visible fields are being published before the payload/reset is visible;
move the publish points so that the payload and reset are fully committed before
updating host-visible pointers: ensure the per-queue entry fields
(s_dep_gen_header->queues[q][current_tail].instance_index, .buffer_ptr,
.buffer_seq) and buffer reset (current_buf_ptr, buf->count) are written, then
execute the memory barrier (wmb()) and only after that assign the host-visible
tail pointer (s_dep_gen_header->queue_tails[q]) and any other host-visible
pointer updates; apply the same reorder to the other occurrences involving
s_dep_gen_header, queue_tails, current_buf_ptr and buf->count (also at the other
noted spots).

#include "common/unified_log.h"
#include "host/profiling_copy.h"

DepGenCollector::~DepGenCollector() { stop(); }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Rollback partial init() allocations before returning.

If any alloc_single_buffer() call fails after earlier allocations succeeded, init() returns immediately and leaves those buffers/mappings behind. Because initialized_ is still false, finalize() exits at Line 265 and the destructor only calls stop(), so the partial state is never unwound.

Also applies to: 111-130, 264-266

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/platform/src/host/dep_gen_collector.cpp` at line 38,
DepGenCollector::init() must roll back any successfully created buffers/mappings
if a later alloc_single_buffer() fails: on error, iterate the list of
already-allocated entries (the same containers used in init()), unmap/free each
buffer, close any fds, and remove/clear those entries so no partial state
remains; ensure initialized_ stays false and any temporary resources are
released (or alternatively set initialized_ = true and call finalize() only
after the partial state has been made consistent) so the destructor/stop() path
won't leak. Reference symbols: DepGenCollector::init(), alloc_single_buffer(),
finalize(), DepGenCollector::~DepGenCollector(), initialized_, stop().

Comment on lines +540 to +543
for (int32_t i = 0; i < tc; i++) {
tref_buf[i].ptr = reinterpret_cast<const Tensor *>(&rec.tensors[i][0]);
atype_buf[i] = static_cast<TensorArgType>(rec.arg_types[i]);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

❓ Verification inconclusive

🏁 Script executed:

#!/bin/bash
# Verify whether DepGenRecord::tensors is typed/aligned storage or raw bytes.
fd -i 'dep_gen\.h|tensor\.h' src | xargs -r rg -n -C3 'struct DepGenRecord|tensors\[|alignas\(Tensor\)|struct Tensor|class Tensor'

Repository: hw-native-sys/simpler

Length of output: 7792


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp"
HDR="src/a5/platform/include/common/dep_gen.h"

echo "=== dep_gen_replay.cpp: cast site (lines ~520-620) ==="
nl -ba "$FILE" | sed -n '500,620p'

echo "=== dep_gen_replay.cpp: where tref_buf.tptr is dereferenced/consumed (lines ~620-740) ==="
nl -ba "$FILE" | sed -n '620,740p'

echo "=== dep_gen_replay.cpp: find usages of tref_buf and '.ptr' ==="
rg -n "tref_buf|\.ptr\b" "$FILE"

echo "=== dep_gen.h: DepGenRecord::tensors definition ==="
nl -ba "$HDR" | sed -n '70,140p'

Repository: hw-native-sys/simpler

Length of output: 181


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp"
HDR="src/a5/platform/include/common/dep_gen.h"

echo "=== dep_gen_replay.cpp: cast site (lines ~500-620) ==="
cat -n "$FILE" | sed -n '500,620p'

echo "=== dep_gen_replay.cpp: where tref_buf is consumed (lines ~620-740) ==="
cat -n "$FILE" | sed -n '620,740p'

echo "=== dep_gen_replay.cpp: find usages of tref_buf and '.ptr' ==="
rg -n "tref_buf|\.ptr\b" "$FILE"

echo "=== dep_gen.h: DepGenRecord::tensors definition ==="
cat -n "$HDR" | sed -n '70,140p'

Repository: hw-native-sys/simpler

Length of output: 18259


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Focused file locations
TENSOR_HDRS="$(fd -i 'tensor\.h' src | head -n 20)"
echo "Tensor headers (sample):"
echo "$TENSOR_HDRS"

echo "=== Find Tensor / TensorRef definitions ==="
fd -i 'tensor\.h' src | xargs -r rg -n --hidden --no-ignore-vcs 'struct Tensor\b|class Tensor\b|struct TensorRef\b|class TensorRef\b|using TensorRef\b|DEP_GEN_TENSOR_SIZE\b'

echo "=== dep_gen capture: where DepGenRecord::tensors gets written ==="
fd -i 'dep_gen_.*\.(cpp|cc|cxx|h|hpp|hxx)' src | xargs -r rg -n --hidden --no-ignore-vcs '&\s*rec\.tensors|dep_gen.*tensors\[|memcpy\([^,]*rec\.tensors|rec\.tensors\s*\[|DEP_GEN_TENSOR_SIZE'
# Also search by the field name more broadly (limited to relevant dirs via ripgrep)
rg -n --hidden --no-ignore-vcs '\bDepGenRecord\b|tensors\[\w*\]\[|DEP_GEN_TENSOR_SIZE' src/a5/platform/include/common/dep_gen.h src/a5/runtime/tensormap_and_ringbuffer/host src/a5/runtime/tensormap_and_ringbuffer/runtime || true

Repository: hw-native-sys/simpler

Length of output: 252


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Definitions: Tensor / TensorRef / TensorArgType / DEP_GEN_TENSOR_SIZE
echo "=== Tensor / TensorRef definitions (search) ==="
rg -n --hidden --no-ignore-vcs --glob 'src/**/tensor.h' \
  'struct TensorRef\b|class TensorRef\b|using TensorRef\b|struct Tensor\b|class Tensor\b|enum class TensorArgType\b|enum TensorArgType\b|DEP_GEN_TENSOR_SIZE\b|TensorArgType\b' \
  src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h src/a2a3/runtime/tensormap_and_ringbuffer/runtime/tensor.h

echo
echo "=== dep_gen.h: constants around DEP_GEN_TENSOR_SIZE ==="
rg -n --hidden --no-ignore-vcs 'DEP_GEN_TENSOR_SIZE' src/a5/platform/include/common/dep_gen.h
cat -n src/a5/platform/include/common/dep_gen.h | sed -n '1,220p'

# 2) Where DepGenRecord::tensors gets written in capture path
echo
echo "=== Find writes/initialization to DepGenRecord::tensors ==="
rg -n --hidden --no-ignore-vcs \
  'rec\.tensors\s*\[|\.tensors\s*\[|tensors\s*\[.*DEP_GEN_TENSOR_SIZE|DEP_GEN_TENSOR_SIZE' \
  src/a5 | head -n 200

echo
echo "=== Search for serialization/copy into tensors blobs (memcpy / placement new / make_tensor_* ) ==="
rg -n --hidden --no-ignore-vcs \
  'memcpy\([^;]*tensors|std::memcpy\([^;]*tensors|placement new|::new\s*\([^;]*tensors|reinterpret_cast<\s*Tensor\s*\*>\s*\(&rec\.tensors' \
  src/a5 | head -n 200

# 3) Also check if dep_gen capture exists in other modules
echo
echo "=== Search for DepGenRecord usage outside a5 ==="
rg -n --hidden --no-ignore-vcs 'DepGenRecord\b' src | head -n 200

Repository: hw-native-sys/simpler

Length of output: 113


🏁 Script executed:

#!/bin/bash
set -euo pipefail

A5_TENSOR_HDR="src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h"
A2A3_TENSOR_HDR="src/a2a3/runtime/tensormap_and_ringbuffer/runtime/tensor.h"
DEP_GEN_HDR="src/a5/platform/include/common/dep_gen.h"
DEP_GEN_CPP_DIR="src/a5"

echo "=== Tensor: find struct/class Tensor + TensorRef in a5 tensor.h ==="
rg -n --no-heading --hidden --no-ignore-vcs 'struct Tensor\b|class Tensor\b|TensorRef|struct TensorRef\b|alignas\(Tensor\)|make_tensor_' "$A5_TENSOR_HDR"
echo
echo "=== Tensor: show relevant sections in a5 tensor.h ==="
# show around TensorRef first, then Tensor
rg -n --no-heading --hidden --no-ignore-vcs 'TensorRef' "$A5_TENSOR_HDR" | head -n 20 | cut -d: -f1 | while read -r ln; do
  start=$((ln-20)); end=$((ln+80));
  echo "--- $A5_TENSOR_HDR:$start-$end ---"
  sed -n "${start},${end}p" "$A5_TENSOR_HDR" | cat
done
echo
rg -n --no-heading --hidden --no-ignore-vcs 'struct Tensor\b|class Tensor\b' "$A5_TENSOR_HDR" | head -n 20 | cut -d: -f1 | while read -r ln; do
  start=$((ln-20)); end=$((ln+120));
  echo "--- $A5_TENSOR_HDR:$start-$end ---"
  sed -n "${start},${end}p" "$A5_TENSOR_HDR" | cat
done

echo
echo "=== dep_gen.h: confirm DepGenRecord::tensors type ==="
rg -n --no-heading --hidden --no-ignore-vcs 'struct DepGenRecord\b|tensors\[' "$DEP_GEN_HDR"
sed -n '80,140p' "$DEP_GEN_HDR" | cat

echo
echo "=== Capture path: find writes into rec.tensors in src/a5 ==="
rg -n --hidden --no-ignore-vcs \
  'rec\.tensors\s*\[|\.tensors\s*\[.*DEP_GEN_TENSOR_SIZE|tensors\s*\[.*\]\s*\=' \
  "$DEP_GEN_CPP_DIR" | head -n 200

echo
echo "=== Capture path: find memcpy/placement-new into tensors blobs ==="
rg -n --hidden --no-ignore-vcs \
  'memcpy\s*\([^;]*tensors|std::memcpy\s*\([^;]*tensors|::new\s*\([^;]*tensors|placement new|start_lifetime' \
  "$DEP_GEN_CPP_DIR" | head -n 200

echo
echo "=== Capture path: find code that casts blob to Tensor (or TensorRef) ==="
rg -n --hidden --no-ignore-vcs \
  'reinterpret_cast\s*<\s*(const\s+)?Tensor\s*\*>\s*\(&.*tensors|reinterpret_cast<\s*(const\s+)?Tensor\s*\*>' \
  "$DEP_GEN_CPP_DIR" | head -n 200

Repository: hw-native-sys/simpler

Length of output: 276


🏁 Script executed:

#!/bin/bash
set -euo pipefail

A5_TENSOR_HDR="src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h"

echo "=== Occurrences of TensorRef in a5 tensor.h ==="
rg -n --hidden --no-ignore-vcs 'TensorRef' "$A5_TENSOR_HDR" || true

echo
echo "=== Occurrences of 'struct Tensor' / 'class Tensor' in a5 tensor.h ==="
rg -n --hidden --no-ignore-vcs 'struct Tensor\b|class Tensor\b' "$A5_TENSOR_HDR" || true

echo
echo "=== Print tail/middle sections of a5 tensor.h (to locate definitions) ==="
echo "--- head ---"
sed -n '1,220p' "$A5_TENSOR_HDR" | cat
echo
echo "--- mid (220-420) ---"
sed -n '220,420p' "$A5_TENSOR_HDR" | cat
echo
echo "--- mid2 (420-620) ---"
sed -n '420,620p' "$A5_TENSOR_HDR" | cat
echo
echo "--- tail (620-900) ---"
sed -n '620,900p' "$A5_TENSOR_HDR" | cat

Repository: hw-native-sys/simpler

Length of output: 25484


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Locate TensorRef definition in repo ==="
rg -n --hidden --no-ignore-vcs '\bTensorRef\b' src | head -n 200

echo
echo "=== Find capture code that writes into DepGenRecord::tensors blobs ==="
rg -n --hidden --no-ignore-vcs '\.tensors\s*\[' src/a5 | head -n 200

echo
echo "=== Find where DepGenRecord is constructed/filled (a5 capture path) ==="
rg -n --hidden --no-ignore-vcs 'DepGenRecord' src/a5 | head -n 200

echo
echo "=== Search for memcpy of Tensor-sized blobs into tensors[][] ==="
rg -n --hidden --no-ignore-vcs '\bmemcpy\s*\([^;]*tensors' src/a5 | head -n 200

Repository: hw-native-sys/simpler

Length of output: 9724


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Look at the memcpy site that fills DepGenRecord::tensors
FILE1="src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp"
echo "=== dep_gen_collector_aicpu.cpp around memcpy(rec->tensors[i], ...) ==="
cat -n "$FILE1" | sed -n '240,330p'

# 2) Look at static_assert in orchestrator about tensor blob size
FILE2="src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp"
echo
echo "=== pto_orchestrator.cpp static_assert sizeof(Tensor) == DEP_GEN_TENSOR_SIZE ==="
cat -n "$FILE2" | sed -n '1,120p'

# 3) Look at comment in dep_gen_collector_aicpu.h about DEP_GEN_TENSOR_SIZE bytes memcpy'd
FILE3="src/a5/platform/include/aicpu/dep_gen_collector_aicpu.h"
echo
echo "=== dep_gen_collector_aicpu.h comment about memcpy into record.tensors[i] ==="
cat -n "$FILE3" | sed -n '60,140p'

# 4) Look at TensorRef union definition to see how Tensor/metadata are accessed on host
FILE4="src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h"
echo
echo "=== pto_types.h TensorRef union ==="
cat -n "$FILE4" | sed -n '120,220p'

Repository: hw-native-sys/simpler

Length of output: 19882


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for trivially copyable / implicit lifetime assertions
rg -n --hidden --no-ignore-vcs 'is_trivially_copyable<.*Tensor|is_trivially_copyable|implicit.?lifetime|std::launder|launder\(' \
  src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp src/a5 | head -n 200

echo
echo "=== Search for any Tensor object reconstruction in dep_gen replay ==="
rg -n --hidden --no-ignore-vcs 'dep_gen_replay.*(memcpy|bit_cast|std::launder|placement new|start_lifetime)' \
  src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp || true

echo
echo "=== Inspect DepGenRecord alignment/tensor size constants in dep_gen.h ==="
rg -n --hidden --no-ignore-vcs 'DEP_GEN_TENSOR_SIZE|CORE_MAX_TENSOR_ARGS|alignof\(DepGenRecord\)|__attribute__\(\(aligned|offsetof\(DepGenRecord, tensors\)' \
  src/a5/platform/include/common/dep_gen.h

Repository: hw-native-sys/simpler

Length of output: 47


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Check whether Tensor is asserted as trivially copyable
echo "=== Search for trivially copyable assertions for Tensor ==="
rg -n --hidden --no-ignore-vcs \
  'is_trivially_copyable.*Tensor|is_trivially_copyable_v.*Tensor|std::is_trivially_copyable.*Tensor|static_assert\([^)]*trivially_copyable[^)]*Tensor' \
  src | head -n 200 || true

echo
echo "=== Search for std::start_lifetime_as / std::launder usage ==="
rg -n --hidden --no-ignore-vcs \
  'start_lifetime_as<|std::start_lifetime_as|std::launder|__builtin_launder' \
  src | head -n 200 || true

echo
echo "=== Find C++ standard flags (-std=c++*) in build files ==="
rg -n --hidden --no-ignore-vcs \
  '-std=c\+\+[0-9]+\s*|/std:c\+\+[0-9]+' \
  . | head -n 200 || true

echo
echo "=== Show alignas/lifetime relevant traits for Tensor (quick check) ==="
# show type declaration line containing alignas(64)
rg -n --hidden --no-ignore-vcs 'struct alignas\(64\) Tensor|class alignas\(64\) Tensor|struct alignas\(64\) Tensor' \
  src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h

Repository: hw-native-sys/simpler

Length of output: 1964


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Confirm DepGenRecord::tensors is raw byte storage
rg -n 'struct DepGenRecord|uint8_t\s+tensors\[' src/a5/platform/include/common/dep_gen.h

# 2) Confirm capture uses memcpy into rec->tensors[i]
rg -n 'memcpy\s*\(rec->tensors\[' src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp

# 3) Confirm Tensor alignment/size
rg -n 'struct\s+alignas\(64\)\s+Tensor|static_assert\(sizeof\(Tensor\)' src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h

# 4) Check whether Tensor is treated as implicit-lifetime / trivially copyable anywhere
rg -n --hidden --no-ignore-vcs 'is_trivially_copyable<.*Tensor|is_trivially_copyable_v<.*Tensor|implicit.?lifetime|start_lifetime_as|std::launder|__builtin_launder' src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp src/a5 | head -n 200

# 5) Find C++ standard flags used by this repo (implicit-lifetime rules depend on it)
rg -n --hidden --no-ignore-vcs --glob '*.{cmake,sh,txt,mk}' '-std=c\+\+[0-9]+' . | head -n 200

Repository: hw-native-sys/simpler

Length of output: 1964


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# C++ standard mode affects whether memcpy-written storage is valid for implicit-lifetime types.
echo "== C++ standard flags =="
rg -n --hidden --no-ignore-vcs --glob '*.{cmake,sh,txt,mk,cc,cpp,h,hpp}' '-std=c\+\+[0-9]+' . || true

echo
echo "== Tensor trivial/implicit-lifetime signals (look for traits/asserts) =="
rg -n --hidden --no-ignore-vcs 'is_trivially_copyable.*Tensor|is_trivially_copyable_v.*Tensor|implicit.?lifetime|start_lifetime_as|std::launder|__builtin_launder' \
  src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h || true

echo
echo "== Any existing pattern for materializing tensors from byte blobs =="
rg -n --hidden --no-ignore-vcs 'start_lifetime_as<.*Tensor>|placement new.*Tensor|std::launder.*Tensor|bit_cast.*Tensor' src/a5 || true

Repository: hw-native-sys/simpler

Length of output: 1964


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Locate this repo's C++ standard flags (-std=c++*)
rg -n --hidden --no-ignore-vcs --glob '*.{cmake,sh,txt,mk,cc,cpp,h,hpp}' '-std=c\+\+[0-9]+' . || true

# 2) Confirm DepGenRecord tensor blob type is bytes
rg -n --hidden --no-ignore-vcs 'struct DepGenRecord|uint8_t\s+tensors\[' src/a5/platform/include/common/dep_gen.h

# 3) Confirm capture uses memcpy into rec->tensors[i]
rg -n --hidden --no-ignore-vcs 'memcpy\s*\(rec->tensors\[.*\],\s*tensor_ptrs\[' \
  src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp

# 4) Check for any traits/asserts about Tensor triviality/implicit-lifetime support
rg -n --hidden --no-ignore-vcs 'is_trivially_copyable.*Tensor|is_trivial.*Tensor|implicit.?lifetime|start_lifetime_as<|std::launder|__builtin_launder' \
  src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h \
  src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp \
  src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h || true

Repository: hw-native-sys/simpler

Length of output: 1964


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) C++ standard mode (object-lifetime rules depend on it)
rg -n --hidden --no-ignore-vcs --glob '*.{cmake,sh,txt,mk,cc,cpp,h,hpp}' '-std=c\+\+[0-9]+' . || true

# 2) Confirm tensors[] is raw bytes and capture uses memcpy into it
rg -n --hidden --no-ignore-vcs 'struct DepGenRecord|uint8_t\s+tensors\[' src/a5/platform/include/common/dep_gen.h
rg -n --hidden --no-ignore-vcs 'memcpy\s*\(rec->tensors\[|memcpy\s*\(rec\.tensors\[' src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp

# 3) Look for any Tensor implicit-lifetime / triviality guarantees or existing lifetime-materialization patterns
rg -n --hidden --no-ignore-vcs 'is_trivially_copyable.*Tensor|std::is_trivially_copyable.*Tensor|implicit.?lifetime|start_lifetime_as|std::launder|__builtin_launder' \
  src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h \
  src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp \
  src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h || true

Repository: hw-native-sys/simpler

Length of output: 1964


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Locate C++ standard flags (-std=c++*)
rg -n --hidden --no-ignore-vcs --glob '*.{cmake,sh,txt,mk,cc,cpp,h,hpp}' '-std=c\+\+[0-9]+' . || true

# 2) Reconfirm capture writes tensors as bytes
rg -n --hidden --no-ignore-vcs 'memcpy\s*\(rec->tensors\[|memset\s*\(rec->tensors\[|DEP_GEN_TENSOR_SIZE' \
  src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp src/a5/platform/include/common/dep_gen.h

# 3) Look for any established implicit-lifetime / object-materialization helpers
rg -n --hidden --no-ignore-vcs 'start_lifetime_as<|std::launder|__builtin_launder|placement new|memcpy\s*\(.*Tensor' \
  src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h src/a5/runtime || true

Repository: hw-native-sys/simpler

Length of output: 1964


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Blob type/layout
rg -n 'uint8_t\s+tensors\[' src/a5/platform/include/common/dep_gen.h
rg -n '__attribute__\(\(aligned\(64\)\)\)|offsetof\(DepGenRecord,\s*tensors\)' src/a5/platform/include/common/dep_gen.h

# Capture writes bytes into blob
rg -n 'memcpy\s*\(\s*rec->tensors\[' src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp

# Tensor alignment/size and (if present) trivial/implicit-lifetime indicators
rg -n 'struct\s+alignas\(64\)\s+Tensor|static_assert\(sizeof\(Tensor\)' src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h
rg -n 'is_trivially_copyable|is_trivial|implicit.?lifetime|start_lifetime_as|std::launder|__builtin_launder' src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp || true

# C++ standard mode (affects implicit-lifetime rules)
rg -n --hidden --no-ignore-vcs --glob '*.{cmake,sh,txt,mk,cc,cpp,h,hpp}' '-std=c\+\+[0-9]+' . || true

Repository: hw-native-sys/simpler

Length of output: 1964


Avoid object-lifetime UB when replaying Tensor from DepGenRecord::tensors

dep_gen_replay.cpp reinterprets the DepGenRecord::tensors byte blobs (uint8_t tensors[...][DEP_GEN_TENSOR_SIZE]) as const Tensor* (reinterpret_cast<const Tensor *>(&rec.tensors[i][0])) and then dereferences them to read Tensor fields. The capture path fills these blobs via memcpy(rec->tensors[i], ..., DEP_GEN_TENSOR_SIZE) (or zeros for null slots), and the intended layout/alignment is enforced (DepGenRecord aligns tensors[] to 64B; Tensor is alignas(64) with static_assert(sizeof(Tensor) == DEP_GEN_TENSOR_SIZE)). The remaining risk is C++ object-lifetime/implicit-lifetime legality for turning memcpy-written bytes into an active Tensor object; if the build mode doesn’t make this pattern language-safe, replay should materialize into a real aligned Tensor object before dereference.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp` around lines
540 - 543, The loop currently reinterprets raw bytes in DepGenRecord::tensors as
const Tensor* which risks object-lifetime UB; instead allocate a temporary
array/vector of real Tensor objects (e.g. std::vector<Tensor> replay_tensors(tc)
or an aligned buffer of Tensor) and for each i do a memcpy(&replay_tensors[i],
&rec.tensors[i][0], sizeof(Tensor)) to materialize a real aligned Tensor object,
then set tref_buf[i].ptr = &replay_tensors[i] and atype_buf[i] as before (ensure
replay_tensors lives long enough for the replay usage).

Comment on lines +568 to +607
for (size_t j = rec_i + 1; j < num_records; j++) {
const DepGenRecord &maybe = records[j];
if (!(maybe.flags & DEP_GEN_FLAG_OVERFLOW)) {
LOG_ERROR(
"dep_gen replay: unterminated overflow chain at rec_idx=%zu (task_id=%" PRIu64 ")", rec_i,
rec.task_id
);
break;
}
if (maybe.task_id != rec.task_id) {
LOG_ERROR(
"dep_gen replay: orphan overflow at rec_idx=%zu (expected task_id=%" PRIu64 ", found %" PRIu64
")",
j, rec.task_id, maybe.task_id
);
break;
}
const auto *over = reinterpret_cast<const DepGenOverflowRecord *>(&maybe);
uint16_t over_dc = over->dep_count;
if (over_dc > DEP_GEN_OVERFLOW_DEPS_PER_RECORD) {
LOG_ERROR(
"dep_gen replay: clamping overflow dep_count %u > %d at rec_idx=%zu (task_id=%" PRIu64 ")",
over_dc, DEP_GEN_OVERFLOW_DEPS_PER_RECORD, j, rec.task_id
);
over_dc = DEP_GEN_OVERFLOW_DEPS_PER_RECORD;
}
full_deps_buf.insert(full_deps_buf.end(), over->deps, over->deps + over_dc);
if (over->flags & DEP_GEN_FLAG_LAST_OVERFLOW) {
chain_complete = true;
break;
}
}
if (!chain_complete) {
LOG_ERROR(
"dep_gen replay: chain for task_id=%" PRIu64 " missing LAST_OVERFLOW marker — "
"using partial dep list (%zu deps)",
rec.task_id, full_deps_buf.size()
);
}
deps_data = full_deps_buf.data();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail replay on malformed overflow chains instead of writing partial deps.

The orphan/unterminated-chain paths only log and keep going with a truncated explicit-dependency list. That still allows a "successful" deps.json with missing arrows, which is worse than failing the replay the way the oracle-divergence path already does.

Suggested fix
             bool chain_complete = false;
+            bool malformed_chain = false;
             for (size_t j = rec_i + 1; j < num_records; j++) {
                 const DepGenRecord &maybe = records[j];
                 if (!(maybe.flags & DEP_GEN_FLAG_OVERFLOW)) {
                     LOG_ERROR(
                         "dep_gen replay: unterminated overflow chain at rec_idx=%zu (task_id=%" PRIu64 ")", rec_i,
                         rec.task_id
                     );
+                    malformed_chain = true;
                     break;
                 }
                 if (maybe.task_id != rec.task_id) {
                     LOG_ERROR(
                         "dep_gen replay: orphan overflow at rec_idx=%zu (expected task_id=%" PRIu64 ", found %" PRIu64
                         ")",
                         j, rec.task_id, maybe.task_id
                     );
+                    malformed_chain = true;
                     break;
                 }
                 const auto *over = reinterpret_cast<const DepGenOverflowRecord *>(&maybe);
                 uint16_t over_dc = over->dep_count;
                 if (over_dc > DEP_GEN_OVERFLOW_DEPS_PER_RECORD) {
@@
                 if (over->flags & DEP_GEN_FLAG_LAST_OVERFLOW) {
                     chain_complete = true;
                     break;
                 }
             }
-            if (!chain_complete) {
+            if (malformed_chain || !chain_complete) {
                 LOG_ERROR(
-                    "dep_gen replay: chain for task_id=%" PRIu64 " missing LAST_OVERFLOW marker — "
-                    "using partial dep list (%zu deps)",
+                    "dep_gen replay: invalid overflow chain for task_id=%" PRIu64,
                     rec.task_id, full_deps_buf.size()
                 );
+                tm_oracle.destroy();
+                tm_annot.destroy();
+                return -7;
             }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp` around lines
568 - 607, The code currently logs orphan/unterminated overflow chains inside
the loop (checking DEP_GEN_FLAG_OVERFLOW and matching task_id) and then
continues, allowing a truncated deps list to be used; instead, when encountering
these malformed chains (or when chain_complete is false after scanning overflow
records) abort the replay immediately rather than proceeding: in the block
handling orphan overflow and the block handling unterminated chains, replace the
LOG_ERROR-only behavior with a hard failure (e.g., return an error status or
throw an exception) from the enclosing function so that
DepGenRecord/DepGenOverflowRecord chains that are malformed do not lead to using
full_deps_buf/deps_data; ensure the failure path prevents setting deps_data and
propagates a clear error for the caller to detect.

Comment on lines +488 to +512
if (is_dep_gen_enabled()) {
const void *tensor_ptrs[MAX_TENSOR_ARGS];
// TensorArgType is `enum class : int32_t` (4 bytes); the on-disk record
// packs arg_types as uint8_t[16] (5-value enum fits in a byte). Narrow
// each tag here rather than letting the AICPU writer reinterpret a
// 4×-wider array as bytes — that path silently lost two of every three
// tags on little-endian and synthesized phantom self-edges in replay.
uint8_t arg_types_u8[MAX_TENSOR_ARGS];
// Clamp to MAX_TENSOR_ARGS even though the Arg builder caps adds at
// MAX_TENSOR_ARGS: defensive against any future builder bypass /
// shared-memory bit-flip that could otherwise overrun the two
// MAX_TENSOR_ARGS-sized stack buffers above.
const int tc_raw = args.tensor_count();
const int tc = tc_raw > MAX_TENSOR_ARGS ? MAX_TENSOR_ARGS : tc_raw;
for (int i = 0; i < tc; i++) {
// OUTPUT slots carry create_info (not yet a Tensor); skip them —
// they have no producer to look up and replay's per-tensor loop
// also skips OUTPUT.
tensor_ptrs[i] = (args.tag(i) == TensorArgType::OUTPUT) ? nullptr : args.tensor(i).ptr;
arg_types_u8[i] = static_cast<uint8_t>(args.tag(i));
}
dep_gen_aicpu_record_submit(
task_id.raw, orch->in_manual_scope(), tc, tensor_ptrs, arg_types_u8,
static_cast<int>(args.explicit_dep_count()), reinterpret_cast<const uint64_t *>(args.explicit_deps_data())
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Record the runtime-filtered explicit deps, not the raw input list.

This capture writes args.explicit_deps_data() verbatim, but the runtime immediately below drops invalid/already-dead deps before they ever participate in fanin. Replay can therefore emit explicit edges that the runtime never enforced, which makes deps.json drift from the real dependency graph.

Also applies to: 534-551

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` around
lines 488 - 512, The code currently records the raw
args.explicit_deps_data()/args.explicit_dep_count() into
dep_gen_aicpu_record_submit, but you must record the runtime-filtered
explicit-deps list (the array and count produced after invalid/already-dead deps
are dropped) so replay matches actual enforced edges; update the
dep_gen_aicpu_record_submit calls (the one shown and the similar call around the
534-551 region) to pass the filtered deps buffer and its filtered count instead
of args.explicit_deps_data() and args.explicit_dep_count(), using the same
filtered-deps variable(s) produced by the runtime’s dep-filtering code path.

Comment on lines +139 to +142
if not deps_path.exists():
# Output dir exists but no deps.json — another diagnostic flag was
# on (e.g. just --enable-l2-swimlane) but not --enable-dep-gen.
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when dep_gen is enabled but deps.json is missing.

Line 139 currently returns silently, which can mask the exact regression this test is supposed to catch when dep_gen is effectively enabled.

Proposed fix
         deps_path = out_dir / "deps.json"
-        if not deps_path.exists():
-            # Output dir exists but no deps.json — another diagnostic flag was
-            # on (e.g. just --enable-l2-swimlane) but not --enable-dep-gen.
-            return
+        assert deps_path.exists(), (
+            f"dep_gen was enabled but {deps_path} is missing. "
+            "Likely cause: dep_gen capture/replay did not emit the artifact."
+        )
         with deps_path.open() as f:
             deps = json.load(f)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py` around
lines 139 - 142, The code currently returns silently when deps_path (the
expected deps.json) is missing, which hides failures when dep_gen is enabled;
replace the silent return with a fail-fast check (e.g. assert
deps_path.exists(), f"dep_gen enabled but deps.json missing at {deps_path}" or
pytest.fail(...) ) so the test fails with a clear message; ensure pytest is
imported if you use pytest.fail and reference the symbols deps_path, deps.json
and dep_gen in the failure message to aid debugging.

@indigo1973 indigo1973 changed the title Add: dep_gen capture+replay support on a5 [WIP] Add: dep_gen capture+replay support on a5 May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant