[WIP] Add: dep_gen capture+replay support on a5#886
Conversation
Port the dep_gen (SubmitTrace) feature from a2a3 to a5 so the
tensormap_and_ringbuffer runtime on a5 can produce deps.json and feed
flow events into swimlane_converter.py. Without this, --enable-dep-gen
was a no-op on a5 and merged_swimlane_*.json had no dependency arrows.
Reused from a2a3 verbatim (byte-identical):
- Shared-memory ABI: common/dep_gen.h (DepGenRecord 2624 B, overflow
chain, SPSC free_queue, per-thread ready_queue)
- AICPU writer: aicpu/dep_gen_collector_aicpu.{h,cpp}
- Runtime replay: runtime/tensormap_and_ringbuffer/host/dep_gen_replay
- Orchestrator capture point + aicpu_executor lifecycle hooks
- 5 platform_config constants + PROFILING_FLAG_DEP_GEN bit
Specialized for a5 (no SVM, see profiling_common diff vs a2a3):
- dep_gen_collector.cpp uses alloc_single_buffer (malloc shadow +
profiling_copy_to_device) instead of identity-mapping when
register_cb is null — matches a5's PMU/L2Perf/Dump collectors.
- Two-phase set_memory_context: callbacks first, then shm pointers
once the region is committed, so start(tf) gates correctly.
- reconcile_counters explicitly copy_from_device's the BufferState +
current_buf before reading (mgmt thread is stopped by then).
- finalize lets BufferPoolManager::clear_mappings() be the single
source of truth for host-shadow lifetime — no per-collector dedup.
Sim path: dlsym set_platform_dep_gen_base / set_dep_gen_enabled out of
the AICPU .so and forward kernel_args.dep_gen_data_base + enable flag
at boot, mirroring the existing pmu / dump / l2_perf setters.
Onboard kernel.cpp adds two lines to forward dep_gen_data_base +
PROFILING_FLAG_DEP_GEN into the AICPU writer's globals, mirroring the
existing PMU / L2 / Dump setters.
c_api: run_prepared's enable_dep_gen parameter is no longer ignored —
wired to runner->set_dep_gen_enabled() on both onboard and sim.
Tests:
- tests/st/a5/.../dfx/dep_gen/test_dep_gen.py: 6-edge validation
against vector_example orchestration (byte-identical to a2a3 — same
expected edge set).
- tests/st/a5/.../dfx/dep_gen/test_dep_gen_chain.py: overflow chain
regression for >64 explicit deps.
Docs:
- docs/dfx/dep_gen.md: §8 Architecture Touchpoints now lists both
platforms; "Currently a2a3 only" line removed.
- src/a5/runtime/.../docs/profiling_levels.md: Code Locations point
at src/a5/ (was stale src/a2a3/ refs from PR hw-native-sys#777 cleanup) and
add a dep_gen entry.
📝 WalkthroughWalkthroughThis PR introduces a complete dependency-generation (DepGen) capture and replay system for the a5 platform. It enables offline analysis of orchestrator task submission graphs by capturing per-submit metadata (task IDs, tensor references, explicit dependencies) into device-resident buffers, transferring completed buffers to the host, and replaying them to generate a ChangesDepGen Dependency Capture and Replay
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request ports the dep_gen (SubmitTrace) capture and offline replay infrastructure to the a5 platform, mirroring the a2a3 implementation. It introduces the shared-memory structures, the AICPU writer, the host collector, and the host-side replay mechanism that performs a differential check to emit deps.json, along with corresponding integration tests. The review feedback highlights critical concurrency and memory management issues that must be addressed: several incorrect or missing memory barriers (wmb() and rmb()) in the AICPU writer could lead to stale reads on weakly-ordered architectures, a memory leak exists where std::malloc'd host shadows are not freed during finalization, and a data race on total_collected_ requires it to be declared as atomic.
| } else { | ||
| host_ptr = std::malloc(size); | ||
| if (host_ptr == nullptr) { | ||
| LOG_ERROR("DepGenCollector: host shadow alloc failed for %zu bytes", size); | ||
| free_cb_(dev_ptr); | ||
| if (host_ptr_out) *host_ptr_out = nullptr; | ||
| return nullptr; | ||
| } | ||
| std::memset(host_ptr, 0, size); | ||
| profiling_copy_to_device(dev_ptr, host_ptr, size); | ||
| } |
There was a problem hiding this comment.
When register_cb_ is null (which is always the case on the a5 platform), host shadows are allocated via std::malloc in alloc_single_buffer. However, in finalize(), manager_.clear_mappings() is called to clear the mappings, but it does not free the std::malloc'd host pointers because it cannot unconditionally call std::free without crashing registered mappings. This causes a silent and major memory leak of all host shadows and the SHM host region on every run. We should track the std::malloc'd host pointers in a private vector malloced_host_ptrs_ and free them in finalize().
| } else { | |
| host_ptr = std::malloc(size); | |
| if (host_ptr == nullptr) { | |
| LOG_ERROR("DepGenCollector: host shadow alloc failed for %zu bytes", size); | |
| free_cb_(dev_ptr); | |
| if (host_ptr_out) *host_ptr_out = nullptr; | |
| return nullptr; | |
| } | |
| std::memset(host_ptr, 0, size); | |
| profiling_copy_to_device(dev_ptr, host_ptr, size); | |
| } | |
| } else { | |
| host_ptr = std::malloc(size); | |
| if (host_ptr == nullptr) { | |
| LOG_ERROR("DepGenCollector: host shadow alloc failed for %zu bytes", size); | |
| free_cb_(dev_ptr); | |
| if (host_ptr_out) *host_ptr_out = nullptr; | |
| return nullptr; | |
| } | |
| std::memset(host_ptr, 0, size); | |
| profiling_copy_to_device(dev_ptr, host_ptr, size); | |
| malloced_host_ptrs_.push_back(host_ptr); | |
| } |
| // Free remaining host shadows (per-state buffers + shm region). | ||
| manager_.clear_mappings(); |
There was a problem hiding this comment.
To prevent memory leaks of the host shadows allocated via std::malloc when register_cb_ == nullptr, we must explicitly free them in finalize() after clearing the mappings.
| // Free remaining host shadows (per-state buffers + shm region). | |
| manager_.clear_mappings(); | |
| // Free remaining host shadows (per-state buffers + shm region). | |
| manager_.clear_mappings(); | |
| for (void* ptr : malloced_host_ptrs_) { | |
| std::free(ptr); | |
| } | |
| malloced_host_ptrs_.clear(); |
| rmb(); | ||
| uint32_t head = s_dep_gen_state->free_queue.head; | ||
| uint32_t tail = s_dep_gen_state->free_queue.tail; | ||
|
|
||
| if (head != tail) { | ||
| uint64_t buf_ptr = s_dep_gen_state->free_queue.buffer_ptrs[head % PLATFORM_DEP_GEN_SLOT_COUNT]; |
There was a problem hiding this comment.
The read memory barrier rmb() must be placed after reading tail and before reading buffer_ptrs to prevent speculative loads of stale data on weakly-ordered architectures.
| rmb(); | |
| uint32_t head = s_dep_gen_state->free_queue.head; | |
| uint32_t tail = s_dep_gen_state->free_queue.tail; | |
| if (head != tail) { | |
| uint64_t buf_ptr = s_dep_gen_state->free_queue.buffer_ptrs[head % PLATFORM_DEP_GEN_SLOT_COUNT]; | |
| uint32_t head = s_dep_gen_state->free_queue.head; | |
| uint32_t tail = s_dep_gen_state->free_queue.tail; | |
| if (head != tail) { | |
| rmb(); | |
| uint64_t buf_ptr = s_dep_gen_state->free_queue.buffer_ptrs[head % PLATFORM_DEP_GEN_SLOT_COUNT]; |
References
- On weakly-ordered architectures, ensure that a read memory barrier is explicitly placed between an MMIO read and a subsequent Normal memory read when there is no data or address dependency.
| // Running total of records appended. Equal to ``records_.size()`` after | ||
| // every append; kept separately for the reconcile_counters cross-check | ||
| // even when records_ may be inspected concurrently. | ||
| uint64_t total_collected_ = 0; |
There was a problem hiding this comment.
The member variable total_collected_ is updated on the background management thread under records_mutex_ but read concurrently on other threads via the public getter total_collected() without any synchronization. This constitutes a data race under the C++ memory model and can lead to torn reads on 32-bit platforms or undefined behavior. Declaring total_collected_ as std::atomic<uint64_t> resolves this safely.
| uint64_t total_collected_ = 0; | |
| std::atomic<uint64_t> total_collected_ = 0; | |
| std::vector<void*> malloced_host_ptrs_; |
References
- When a component is accessed by multiple threads, use std::atomic with release-store and acquire-load semantics to establish a happens-before relationship.
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (2)
tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen_chain.py (1)
200-206: ⚡ Quick winTighten explicit-edge validation to reduce false positives.
Add a strict explicit-edge count (
n + 1) and validate the barrier’s single outgoing explicit edge is not self-looping or targeting a producer.Proposed fix
# All N producer→barrier edges must be present. This is the chain # round-trip assertion: pre-chain code drops anything past index 63. assert len(barrier_preds) == n, f"barrier has {len(barrier_preds)} preds, expected {n}" + assert len(explicit_edges) == n + 1, ( + f"expected exactly {n + 1} explicit edges (N producer->barrier + 1 barrier->consumer), " + f"got {len(explicit_edges)}" + ) # Consumer must explicit-depend on the barrier — exactly one outgoing # explicit edge from the barrier. outgoing_explicit_from_barrier = {succ for pred, succ in explicit_edges if pred == barrier_id} assert len(outgoing_explicit_from_barrier) == 1, ( f"barrier {barrier_id} has {len(outgoing_explicit_from_barrier)} outgoing explicit edges, " f"expected 1 (the consumer)" ) + consumer_id = next(iter(outgoing_explicit_from_barrier)) + assert consumer_id != barrier_id and consumer_id not in barrier_preds, ( + f"barrier {barrier_id} outgoing explicit edge points to invalid consumer candidate {consumer_id}" + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen_chain.py` around lines 200 - 206, Tighten the barrier explicit-edge validation: compute outgoing_explicit_from_barrier from explicit_edges (as shown), assert its size equals (len(producers) + 1) instead of 1, then extract the single non-barrier consumer target and ensure it is not a self-loop (target != barrier_id) and not in the producers set; update the assertion message to include the expected count and offending targets using outgoing_explicit_from_barrier, barrier_id, and producers for diagnostics.src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp (1)
42-50: ⚡ Quick winAdd a compile-time guard for arg-slot-count parity.
This block asserts
sizeof(Tensor), but the capture path also assumesMAX_TENSOR_ARGS == CORE_MAX_TENSOR_ARGS. If those ever drift, dep_gen will silently truncate/mis-shape records. A siblingstatic_assertwould catch that at build time.Suggested guard
static_assert(sizeof(Tensor) == DEP_GEN_TENSOR_SIZE, "DepGenRecord::tensors slot size out of sync with sizeof(Tensor)"); +static_assert(MAX_TENSOR_ARGS == CORE_MAX_TENSOR_ARGS, "DepGen arg slot count out of sync with shared-memory ABI");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` around lines 42 - 50, Add a compile-time check that ensures the capture path's assumed arg-slot parity by static_assert-ing MAX_TENSOR_ARGS == CORE_MAX_TENSOR_ARGS so drift is caught at build time; locate the existing Tensor size static_assert near DepGenRecord/tensors in pto_orchestrator.cpp and add a sibling static_assert referencing the macros MAX_TENSOR_ARGS and CORE_MAX_TENSOR_ARGS with a clear error message like "tensor arg slot count mismatch: MAX_TENSOR_ARGS != CORE_MAX_TENSOR_ARGS".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/a5/platform/onboard/host/device_runner.cpp`:
- Around line 753-761: The code currently only logs when
dep_gen_replay_emit_deps_json fails; instead treat that as a fatal error: after
calling dep_gen_replay_emit_deps_json (inside the enable_dep_gen_ block where
dep_gen_collector_.reconcile_counters() is true), check replay_rc and on
non-zero either return a non-zero error code from this function (or set the
overall run/exit status variable and bail out) so the caller sees failure;
update the branch that calls dep_gen_replay_emit_deps_json (symbols:
enable_dep_gen_, dep_gen_collector_.reconcile_counters(),
dep_gen_collector_.records(), make_deps_json_path(),
dep_gen_replay_emit_deps_json) to propagate the error instead of only LOG_ERROR.
In `@src/a5/platform/sim/host/device_runner.cpp`:
- Around line 709-717: The dep-gen path currently logs an error but does not
stop the simulation when dep_gen_replay_emit_deps_json fails; update the block
in device_runner.cpp (around enable_dep_gen_, dep_gen_collector_.stop(),
dep_gen_collector_.reconcile_counters(), make_deps_json_path,
dep_gen_replay_emit_deps_json) so that if replay_rc != 0 you propagate failure
instead of only logging: e.g., set the function/result state to indicate failure
(return an error code or throw an exception / call the routine that aborts the
run used by this module) so the caller sees the failure and the sim run
terminates when deps.json emission fails.
In `@src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp`:
- Around line 72-75: The host-visible fields are being published before the
payload/reset is visible; move the publish points so that the payload and reset
are fully committed before updating host-visible pointers: ensure the per-queue
entry fields (s_dep_gen_header->queues[q][current_tail].instance_index,
.buffer_ptr, .buffer_seq) and buffer reset (current_buf_ptr, buf->count) are
written, then execute the memory barrier (wmb()) and only after that assign the
host-visible tail pointer (s_dep_gen_header->queue_tails[q]) and any other
host-visible pointer updates; apply the same reorder to the other occurrences
involving s_dep_gen_header, queue_tails, current_buf_ptr and buf->count (also at
the other noted spots).
In `@src/a5/platform/src/host/dep_gen_collector.cpp`:
- Line 38: DepGenCollector::init() must roll back any successfully created
buffers/mappings if a later alloc_single_buffer() fails: on error, iterate the
list of already-allocated entries (the same containers used in init()),
unmap/free each buffer, close any fds, and remove/clear those entries so no
partial state remains; ensure initialized_ stays false and any temporary
resources are released (or alternatively set initialized_ = true and call
finalize() only after the partial state has been made consistent) so the
destructor/stop() path won't leak. Reference symbols: DepGenCollector::init(),
alloc_single_buffer(), finalize(), DepGenCollector::~DepGenCollector(),
initialized_, stop().
In `@src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp`:
- Around line 568-607: The code currently logs orphan/unterminated overflow
chains inside the loop (checking DEP_GEN_FLAG_OVERFLOW and matching task_id) and
then continues, allowing a truncated deps list to be used; instead, when
encountering these malformed chains (or when chain_complete is false after
scanning overflow records) abort the replay immediately rather than proceeding:
in the block handling orphan overflow and the block handling unterminated
chains, replace the LOG_ERROR-only behavior with a hard failure (e.g., return an
error status or throw an exception) from the enclosing function so that
DepGenRecord/DepGenOverflowRecord chains that are malformed do not lead to using
full_deps_buf/deps_data; ensure the failure path prevents setting deps_data and
propagates a clear error for the caller to detect.
- Around line 540-543: The loop currently reinterprets raw bytes in
DepGenRecord::tensors as const Tensor* which risks object-lifetime UB; instead
allocate a temporary array/vector of real Tensor objects (e.g.
std::vector<Tensor> replay_tensors(tc) or an aligned buffer of Tensor) and for
each i do a memcpy(&replay_tensors[i], &rec.tensors[i][0], sizeof(Tensor)) to
materialize a real aligned Tensor object, then set tref_buf[i].ptr =
&replay_tensors[i] and atype_buf[i] as before (ensure replay_tensors lives long
enough for the replay usage).
In `@src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp`:
- Around line 488-512: The code currently records the raw
args.explicit_deps_data()/args.explicit_dep_count() into
dep_gen_aicpu_record_submit, but you must record the runtime-filtered
explicit-deps list (the array and count produced after invalid/already-dead deps
are dropped) so replay matches actual enforced edges; update the
dep_gen_aicpu_record_submit calls (the one shown and the similar call around the
534-551 region) to pass the filtered deps buffer and its filtered count instead
of args.explicit_deps_data() and args.explicit_dep_count(), using the same
filtered-deps variable(s) produced by the runtime’s dep-filtering code path.
In `@tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py`:
- Around line 139-142: The code currently returns silently when deps_path (the
expected deps.json) is missing, which hides failures when dep_gen is enabled;
replace the silent return with a fail-fast check (e.g. assert
deps_path.exists(), f"dep_gen enabled but deps.json missing at {deps_path}" or
pytest.fail(...) ) so the test fails with a clear message; ensure pytest is
imported if you use pytest.fail and reference the symbols deps_path, deps.json
and dep_gen in the failure message to aid debugging.
---
Nitpick comments:
In `@src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp`:
- Around line 42-50: Add a compile-time check that ensures the capture path's
assumed arg-slot parity by static_assert-ing MAX_TENSOR_ARGS ==
CORE_MAX_TENSOR_ARGS so drift is caught at build time; locate the existing
Tensor size static_assert near DepGenRecord/tensors in pto_orchestrator.cpp and
add a sibling static_assert referencing the macros MAX_TENSOR_ARGS and
CORE_MAX_TENSOR_ARGS with a clear error message like "tensor arg slot count
mismatch: MAX_TENSOR_ARGS != CORE_MAX_TENSOR_ARGS".
In `@tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen_chain.py`:
- Around line 200-206: Tighten the barrier explicit-edge validation: compute
outgoing_explicit_from_barrier from explicit_edges (as shown), assert its size
equals (len(producers) + 1) instead of 1, then extract the single non-barrier
consumer target and ensure it is not a self-loop (target != barrier_id) and not
in the producers set; update the assertion message to include the expected count
and offending targets using outgoing_explicit_from_barrier, barrier_id, and
producers for diagnostics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8b369d14-d666-4b3b-98ae-5e9b838f7773
📒 Files selected for processing (25)
docs/dfx/dep_gen.mdsrc/a5/platform/include/aicpu/dep_gen_collector_aicpu.hsrc/a5/platform/include/common/dep_gen.hsrc/a5/platform/include/common/kernel_args.hsrc/a5/platform/include/common/platform_config.hsrc/a5/platform/include/host/dep_gen_collector.hsrc/a5/platform/onboard/aicpu/kernel.cppsrc/a5/platform/onboard/host/CMakeLists.txtsrc/a5/platform/onboard/host/device_runner.cppsrc/a5/platform/onboard/host/device_runner.hsrc/a5/platform/onboard/host/pto_runtime_c_api.cppsrc/a5/platform/sim/host/CMakeLists.txtsrc/a5/platform/sim/host/device_runner.cppsrc/a5/platform/sim/host/device_runner.hsrc/a5/platform/sim/host/pto_runtime_c_api.cppsrc/a5/platform/src/aicpu/dep_gen_collector_aicpu.cppsrc/a5/platform/src/host/dep_gen_collector.cppsrc/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cppsrc/a5/runtime/tensormap_and_ringbuffer/docs/profiling_levels.mdsrc/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cppsrc/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpptests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/kernels/orchestration/chain_barrier_orch.cpptests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.pytests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen_chain.py
| if (enable_dep_gen_) { | ||
| dep_gen_collector_.stop(); | ||
| if (dep_gen_collector_.reconcile_counters()) { | ||
| const auto &records = dep_gen_collector_.records(); | ||
| const std::string deps = make_deps_json_path(output_prefix_); | ||
| int replay_rc = dep_gen_replay_emit_deps_json(records.data(), records.size(), deps.c_str()); | ||
| if (replay_rc != 0) { | ||
| LOG_ERROR("dep_gen replay failed (%d) — deps.json not produced", replay_rc); | ||
| } |
There was a problem hiding this comment.
Fail the run when dep-gen emission fails.
When enable_dep_gen_ is on, missing deps.json is a feature failure, not just a log message. Returning success here hides the regression from callers, and the current dep-gen tests only inspect deps.json when it exists.
Suggested fix
int replay_rc = dep_gen_replay_emit_deps_json(records.data(), records.size(), deps.c_str());
if (replay_rc != 0) {
LOG_ERROR("dep_gen replay failed (%d) — deps.json not produced", replay_rc);
+ return replay_rc;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (enable_dep_gen_) { | |
| dep_gen_collector_.stop(); | |
| if (dep_gen_collector_.reconcile_counters()) { | |
| const auto &records = dep_gen_collector_.records(); | |
| const std::string deps = make_deps_json_path(output_prefix_); | |
| int replay_rc = dep_gen_replay_emit_deps_json(records.data(), records.size(), deps.c_str()); | |
| if (replay_rc != 0) { | |
| LOG_ERROR("dep_gen replay failed (%d) — deps.json not produced", replay_rc); | |
| } | |
| if (enable_dep_gen_) { | |
| dep_gen_collector_.stop(); | |
| if (dep_gen_collector_.reconcile_counters()) { | |
| const auto &records = dep_gen_collector_.records(); | |
| const std::string deps = make_deps_json_path(output_prefix_); | |
| int replay_rc = dep_gen_replay_emit_deps_json(records.data(), records.size(), deps.c_str()); | |
| if (replay_rc != 0) { | |
| LOG_ERROR("dep_gen replay failed (%d) — deps.json not produced", replay_rc); | |
| return replay_rc; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/a5/platform/onboard/host/device_runner.cpp` around lines 753 - 761, The
code currently only logs when dep_gen_replay_emit_deps_json fails; instead treat
that as a fatal error: after calling dep_gen_replay_emit_deps_json (inside the
enable_dep_gen_ block where dep_gen_collector_.reconcile_counters() is true),
check replay_rc and on non-zero either return a non-zero error code from this
function (or set the overall run/exit status variable and bail out) so the
caller sees failure; update the branch that calls dep_gen_replay_emit_deps_json
(symbols: enable_dep_gen_, dep_gen_collector_.reconcile_counters(),
dep_gen_collector_.records(), make_deps_json_path(),
dep_gen_replay_emit_deps_json) to propagate the error instead of only LOG_ERROR.
| if (enable_dep_gen_) { | ||
| dep_gen_collector_.stop(); | ||
| if (dep_gen_collector_.reconcile_counters()) { | ||
| const auto &records = dep_gen_collector_.records(); | ||
| const std::string deps = make_deps_json_path(output_prefix_); | ||
| int replay_rc = dep_gen_replay_emit_deps_json(records.data(), records.size(), deps.c_str()); | ||
| if (replay_rc != 0) { | ||
| LOG_ERROR("dep_gen replay failed (%d) — deps.json not produced", replay_rc); | ||
| } |
There was a problem hiding this comment.
Fail the sim run when dep-gen emission fails.
This has the same silent-failure problem as the onboard path: enable_dep_gen_ can succeed from the caller’s point of view even though deps.json was never written. That makes dep-gen regressions easy to miss.
Suggested fix
int replay_rc = dep_gen_replay_emit_deps_json(records.data(), records.size(), deps.c_str());
if (replay_rc != 0) {
LOG_ERROR("dep_gen replay failed (%d) — deps.json not produced", replay_rc);
+ return replay_rc;
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/a5/platform/sim/host/device_runner.cpp` around lines 709 - 717, The
dep-gen path currently logs an error but does not stop the simulation when
dep_gen_replay_emit_deps_json fails; update the block in device_runner.cpp
(around enable_dep_gen_, dep_gen_collector_.stop(),
dep_gen_collector_.reconcile_counters(), make_deps_json_path,
dep_gen_replay_emit_deps_json) so that if replay_rc != 0 you propagate failure
instead of only logging: e.g., set the function/result state to indicate failure
(return an error code or throw an exception / call the routine that aborts the
run used by this module) so the caller sees the failure and the sim run
terminates when deps.json emission fails.
| s_dep_gen_header->queues[q][current_tail].instance_index = 0; | ||
| s_dep_gen_header->queues[q][current_tail].buffer_ptr = buffer_ptr; | ||
| s_dep_gen_header->queues[q][current_tail].buffer_seq = buffer_seq; | ||
| s_dep_gen_header->queue_tails[q] = next_tail; |
There was a problem hiding this comment.
Publish consumer-visible state only after the payload/reset is visible.
queue_tails[q], current_buf_ptr, and buf->count are the host-visible publish points here, but each is stored before the wmb(). That lets the host observe a ready entry before its fields land, or see a recycled/current buffer before count has been reset, or copy records after count increases but before the payload is fully committed.
Suggested ordering fix
s_dep_gen_header->queues[q][current_tail].instance_index = 0;
s_dep_gen_header->queues[q][current_tail].buffer_ptr = buffer_ptr;
s_dep_gen_header->queues[q][current_tail].buffer_seq = buffer_seq;
+ wmb();
s_dep_gen_header->queue_tails[q] = next_tail;- s_dep_gen_state->current_buf_ptr = new_buf_ptr;
- s_dep_gen_state->current_buf_seq = seq + 1;
- wmb();
-
DepGenBuffer *new_buf = reinterpret_cast<DepGenBuffer *>(new_buf_ptr);
new_buf->count = 0;
+ wmb();
+ s_dep_gen_state->current_buf_ptr = new_buf_ptr;
+ s_dep_gen_state->current_buf_seq = seq + 1;- s_dep_gen_state->current_buf_ptr = buf_ptr;
- s_dep_gen_state->current_buf_seq = 0;
- wmb();
DepGenBuffer *buf = reinterpret_cast<DepGenBuffer *>(buf_ptr);
buf->count = 0;
+ wmb();
+ s_dep_gen_state->current_buf_ptr = buf_ptr;
+ s_dep_gen_state->current_buf_seq = 0;- buf->count = idx + static_cast<uint32_t>(needed);
- wmb();
+ wmb();
+ buf->count = idx + static_cast<uint32_t>(needed);Also applies to: 121-126, 149-154, 321-322
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp` around lines 72 - 75,
The host-visible fields are being published before the payload/reset is visible;
move the publish points so that the payload and reset are fully committed before
updating host-visible pointers: ensure the per-queue entry fields
(s_dep_gen_header->queues[q][current_tail].instance_index, .buffer_ptr,
.buffer_seq) and buffer reset (current_buf_ptr, buf->count) are written, then
execute the memory barrier (wmb()) and only after that assign the host-visible
tail pointer (s_dep_gen_header->queue_tails[q]) and any other host-visible
pointer updates; apply the same reorder to the other occurrences involving
s_dep_gen_header, queue_tails, current_buf_ptr and buf->count (also at the other
noted spots).
| #include "common/unified_log.h" | ||
| #include "host/profiling_copy.h" | ||
|
|
||
| DepGenCollector::~DepGenCollector() { stop(); } |
There was a problem hiding this comment.
Rollback partial init() allocations before returning.
If any alloc_single_buffer() call fails after earlier allocations succeeded, init() returns immediately and leaves those buffers/mappings behind. Because initialized_ is still false, finalize() exits at Line 265 and the destructor only calls stop(), so the partial state is never unwound.
Also applies to: 111-130, 264-266
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/a5/platform/src/host/dep_gen_collector.cpp` at line 38,
DepGenCollector::init() must roll back any successfully created buffers/mappings
if a later alloc_single_buffer() fails: on error, iterate the list of
already-allocated entries (the same containers used in init()), unmap/free each
buffer, close any fds, and remove/clear those entries so no partial state
remains; ensure initialized_ stays false and any temporary resources are
released (or alternatively set initialized_ = true and call finalize() only
after the partial state has been made consistent) so the destructor/stop() path
won't leak. Reference symbols: DepGenCollector::init(), alloc_single_buffer(),
finalize(), DepGenCollector::~DepGenCollector(), initialized_, stop().
| for (int32_t i = 0; i < tc; i++) { | ||
| tref_buf[i].ptr = reinterpret_cast<const Tensor *>(&rec.tensors[i][0]); | ||
| atype_buf[i] = static_cast<TensorArgType>(rec.arg_types[i]); | ||
| } |
There was a problem hiding this comment.
❓ Verification inconclusive
🏁 Script executed:
#!/bin/bash
# Verify whether DepGenRecord::tensors is typed/aligned storage or raw bytes.
fd -i 'dep_gen\.h|tensor\.h' src | xargs -r rg -n -C3 'struct DepGenRecord|tensors\[|alignas\(Tensor\)|struct Tensor|class Tensor'Repository: hw-native-sys/simpler
Length of output: 7792
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp"
HDR="src/a5/platform/include/common/dep_gen.h"
echo "=== dep_gen_replay.cpp: cast site (lines ~520-620) ==="
nl -ba "$FILE" | sed -n '500,620p'
echo "=== dep_gen_replay.cpp: where tref_buf.tptr is dereferenced/consumed (lines ~620-740) ==="
nl -ba "$FILE" | sed -n '620,740p'
echo "=== dep_gen_replay.cpp: find usages of tref_buf and '.ptr' ==="
rg -n "tref_buf|\.ptr\b" "$FILE"
echo "=== dep_gen.h: DepGenRecord::tensors definition ==="
nl -ba "$HDR" | sed -n '70,140p'Repository: hw-native-sys/simpler
Length of output: 181
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp"
HDR="src/a5/platform/include/common/dep_gen.h"
echo "=== dep_gen_replay.cpp: cast site (lines ~500-620) ==="
cat -n "$FILE" | sed -n '500,620p'
echo "=== dep_gen_replay.cpp: where tref_buf is consumed (lines ~620-740) ==="
cat -n "$FILE" | sed -n '620,740p'
echo "=== dep_gen_replay.cpp: find usages of tref_buf and '.ptr' ==="
rg -n "tref_buf|\.ptr\b" "$FILE"
echo "=== dep_gen.h: DepGenRecord::tensors definition ==="
cat -n "$HDR" | sed -n '70,140p'Repository: hw-native-sys/simpler
Length of output: 18259
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Focused file locations
TENSOR_HDRS="$(fd -i 'tensor\.h' src | head -n 20)"
echo "Tensor headers (sample):"
echo "$TENSOR_HDRS"
echo "=== Find Tensor / TensorRef definitions ==="
fd -i 'tensor\.h' src | xargs -r rg -n --hidden --no-ignore-vcs 'struct Tensor\b|class Tensor\b|struct TensorRef\b|class TensorRef\b|using TensorRef\b|DEP_GEN_TENSOR_SIZE\b'
echo "=== dep_gen capture: where DepGenRecord::tensors gets written ==="
fd -i 'dep_gen_.*\.(cpp|cc|cxx|h|hpp|hxx)' src | xargs -r rg -n --hidden --no-ignore-vcs '&\s*rec\.tensors|dep_gen.*tensors\[|memcpy\([^,]*rec\.tensors|rec\.tensors\s*\[|DEP_GEN_TENSOR_SIZE'
# Also search by the field name more broadly (limited to relevant dirs via ripgrep)
rg -n --hidden --no-ignore-vcs '\bDepGenRecord\b|tensors\[\w*\]\[|DEP_GEN_TENSOR_SIZE' src/a5/platform/include/common/dep_gen.h src/a5/runtime/tensormap_and_ringbuffer/host src/a5/runtime/tensormap_and_ringbuffer/runtime || trueRepository: hw-native-sys/simpler
Length of output: 252
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Definitions: Tensor / TensorRef / TensorArgType / DEP_GEN_TENSOR_SIZE
echo "=== Tensor / TensorRef definitions (search) ==="
rg -n --hidden --no-ignore-vcs --glob 'src/**/tensor.h' \
'struct TensorRef\b|class TensorRef\b|using TensorRef\b|struct Tensor\b|class Tensor\b|enum class TensorArgType\b|enum TensorArgType\b|DEP_GEN_TENSOR_SIZE\b|TensorArgType\b' \
src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h src/a2a3/runtime/tensormap_and_ringbuffer/runtime/tensor.h
echo
echo "=== dep_gen.h: constants around DEP_GEN_TENSOR_SIZE ==="
rg -n --hidden --no-ignore-vcs 'DEP_GEN_TENSOR_SIZE' src/a5/platform/include/common/dep_gen.h
cat -n src/a5/platform/include/common/dep_gen.h | sed -n '1,220p'
# 2) Where DepGenRecord::tensors gets written in capture path
echo
echo "=== Find writes/initialization to DepGenRecord::tensors ==="
rg -n --hidden --no-ignore-vcs \
'rec\.tensors\s*\[|\.tensors\s*\[|tensors\s*\[.*DEP_GEN_TENSOR_SIZE|DEP_GEN_TENSOR_SIZE' \
src/a5 | head -n 200
echo
echo "=== Search for serialization/copy into tensors blobs (memcpy / placement new / make_tensor_* ) ==="
rg -n --hidden --no-ignore-vcs \
'memcpy\([^;]*tensors|std::memcpy\([^;]*tensors|placement new|::new\s*\([^;]*tensors|reinterpret_cast<\s*Tensor\s*\*>\s*\(&rec\.tensors' \
src/a5 | head -n 200
# 3) Also check if dep_gen capture exists in other modules
echo
echo "=== Search for DepGenRecord usage outside a5 ==="
rg -n --hidden --no-ignore-vcs 'DepGenRecord\b' src | head -n 200Repository: hw-native-sys/simpler
Length of output: 113
🏁 Script executed:
#!/bin/bash
set -euo pipefail
A5_TENSOR_HDR="src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h"
A2A3_TENSOR_HDR="src/a2a3/runtime/tensormap_and_ringbuffer/runtime/tensor.h"
DEP_GEN_HDR="src/a5/platform/include/common/dep_gen.h"
DEP_GEN_CPP_DIR="src/a5"
echo "=== Tensor: find struct/class Tensor + TensorRef in a5 tensor.h ==="
rg -n --no-heading --hidden --no-ignore-vcs 'struct Tensor\b|class Tensor\b|TensorRef|struct TensorRef\b|alignas\(Tensor\)|make_tensor_' "$A5_TENSOR_HDR"
echo
echo "=== Tensor: show relevant sections in a5 tensor.h ==="
# show around TensorRef first, then Tensor
rg -n --no-heading --hidden --no-ignore-vcs 'TensorRef' "$A5_TENSOR_HDR" | head -n 20 | cut -d: -f1 | while read -r ln; do
start=$((ln-20)); end=$((ln+80));
echo "--- $A5_TENSOR_HDR:$start-$end ---"
sed -n "${start},${end}p" "$A5_TENSOR_HDR" | cat
done
echo
rg -n --no-heading --hidden --no-ignore-vcs 'struct Tensor\b|class Tensor\b' "$A5_TENSOR_HDR" | head -n 20 | cut -d: -f1 | while read -r ln; do
start=$((ln-20)); end=$((ln+120));
echo "--- $A5_TENSOR_HDR:$start-$end ---"
sed -n "${start},${end}p" "$A5_TENSOR_HDR" | cat
done
echo
echo "=== dep_gen.h: confirm DepGenRecord::tensors type ==="
rg -n --no-heading --hidden --no-ignore-vcs 'struct DepGenRecord\b|tensors\[' "$DEP_GEN_HDR"
sed -n '80,140p' "$DEP_GEN_HDR" | cat
echo
echo "=== Capture path: find writes into rec.tensors in src/a5 ==="
rg -n --hidden --no-ignore-vcs \
'rec\.tensors\s*\[|\.tensors\s*\[.*DEP_GEN_TENSOR_SIZE|tensors\s*\[.*\]\s*\=' \
"$DEP_GEN_CPP_DIR" | head -n 200
echo
echo "=== Capture path: find memcpy/placement-new into tensors blobs ==="
rg -n --hidden --no-ignore-vcs \
'memcpy\s*\([^;]*tensors|std::memcpy\s*\([^;]*tensors|::new\s*\([^;]*tensors|placement new|start_lifetime' \
"$DEP_GEN_CPP_DIR" | head -n 200
echo
echo "=== Capture path: find code that casts blob to Tensor (or TensorRef) ==="
rg -n --hidden --no-ignore-vcs \
'reinterpret_cast\s*<\s*(const\s+)?Tensor\s*\*>\s*\(&.*tensors|reinterpret_cast<\s*(const\s+)?Tensor\s*\*>' \
"$DEP_GEN_CPP_DIR" | head -n 200Repository: hw-native-sys/simpler
Length of output: 276
🏁 Script executed:
#!/bin/bash
set -euo pipefail
A5_TENSOR_HDR="src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h"
echo "=== Occurrences of TensorRef in a5 tensor.h ==="
rg -n --hidden --no-ignore-vcs 'TensorRef' "$A5_TENSOR_HDR" || true
echo
echo "=== Occurrences of 'struct Tensor' / 'class Tensor' in a5 tensor.h ==="
rg -n --hidden --no-ignore-vcs 'struct Tensor\b|class Tensor\b' "$A5_TENSOR_HDR" || true
echo
echo "=== Print tail/middle sections of a5 tensor.h (to locate definitions) ==="
echo "--- head ---"
sed -n '1,220p' "$A5_TENSOR_HDR" | cat
echo
echo "--- mid (220-420) ---"
sed -n '220,420p' "$A5_TENSOR_HDR" | cat
echo
echo "--- mid2 (420-620) ---"
sed -n '420,620p' "$A5_TENSOR_HDR" | cat
echo
echo "--- tail (620-900) ---"
sed -n '620,900p' "$A5_TENSOR_HDR" | catRepository: hw-native-sys/simpler
Length of output: 25484
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Locate TensorRef definition in repo ==="
rg -n --hidden --no-ignore-vcs '\bTensorRef\b' src | head -n 200
echo
echo "=== Find capture code that writes into DepGenRecord::tensors blobs ==="
rg -n --hidden --no-ignore-vcs '\.tensors\s*\[' src/a5 | head -n 200
echo
echo "=== Find where DepGenRecord is constructed/filled (a5 capture path) ==="
rg -n --hidden --no-ignore-vcs 'DepGenRecord' src/a5 | head -n 200
echo
echo "=== Search for memcpy of Tensor-sized blobs into tensors[][] ==="
rg -n --hidden --no-ignore-vcs '\bmemcpy\s*\([^;]*tensors' src/a5 | head -n 200Repository: hw-native-sys/simpler
Length of output: 9724
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Look at the memcpy site that fills DepGenRecord::tensors
FILE1="src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp"
echo "=== dep_gen_collector_aicpu.cpp around memcpy(rec->tensors[i], ...) ==="
cat -n "$FILE1" | sed -n '240,330p'
# 2) Look at static_assert in orchestrator about tensor blob size
FILE2="src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp"
echo
echo "=== pto_orchestrator.cpp static_assert sizeof(Tensor) == DEP_GEN_TENSOR_SIZE ==="
cat -n "$FILE2" | sed -n '1,120p'
# 3) Look at comment in dep_gen_collector_aicpu.h about DEP_GEN_TENSOR_SIZE bytes memcpy'd
FILE3="src/a5/platform/include/aicpu/dep_gen_collector_aicpu.h"
echo
echo "=== dep_gen_collector_aicpu.h comment about memcpy into record.tensors[i] ==="
cat -n "$FILE3" | sed -n '60,140p'
# 4) Look at TensorRef union definition to see how Tensor/metadata are accessed on host
FILE4="src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h"
echo
echo "=== pto_types.h TensorRef union ==="
cat -n "$FILE4" | sed -n '120,220p'Repository: hw-native-sys/simpler
Length of output: 19882
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for trivially copyable / implicit lifetime assertions
rg -n --hidden --no-ignore-vcs 'is_trivially_copyable<.*Tensor|is_trivially_copyable|implicit.?lifetime|std::launder|launder\(' \
src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp src/a5 | head -n 200
echo
echo "=== Search for any Tensor object reconstruction in dep_gen replay ==="
rg -n --hidden --no-ignore-vcs 'dep_gen_replay.*(memcpy|bit_cast|std::launder|placement new|start_lifetime)' \
src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp || true
echo
echo "=== Inspect DepGenRecord alignment/tensor size constants in dep_gen.h ==="
rg -n --hidden --no-ignore-vcs 'DEP_GEN_TENSOR_SIZE|CORE_MAX_TENSOR_ARGS|alignof\(DepGenRecord\)|__attribute__\(\(aligned|offsetof\(DepGenRecord, tensors\)' \
src/a5/platform/include/common/dep_gen.hRepository: hw-native-sys/simpler
Length of output: 47
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Check whether Tensor is asserted as trivially copyable
echo "=== Search for trivially copyable assertions for Tensor ==="
rg -n --hidden --no-ignore-vcs \
'is_trivially_copyable.*Tensor|is_trivially_copyable_v.*Tensor|std::is_trivially_copyable.*Tensor|static_assert\([^)]*trivially_copyable[^)]*Tensor' \
src | head -n 200 || true
echo
echo "=== Search for std::start_lifetime_as / std::launder usage ==="
rg -n --hidden --no-ignore-vcs \
'start_lifetime_as<|std::start_lifetime_as|std::launder|__builtin_launder' \
src | head -n 200 || true
echo
echo "=== Find C++ standard flags (-std=c++*) in build files ==="
rg -n --hidden --no-ignore-vcs \
'-std=c\+\+[0-9]+\s*|/std:c\+\+[0-9]+' \
. | head -n 200 || true
echo
echo "=== Show alignas/lifetime relevant traits for Tensor (quick check) ==="
# show type declaration line containing alignas(64)
rg -n --hidden --no-ignore-vcs 'struct alignas\(64\) Tensor|class alignas\(64\) Tensor|struct alignas\(64\) Tensor' \
src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.hRepository: hw-native-sys/simpler
Length of output: 1964
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Confirm DepGenRecord::tensors is raw byte storage
rg -n 'struct DepGenRecord|uint8_t\s+tensors\[' src/a5/platform/include/common/dep_gen.h
# 2) Confirm capture uses memcpy into rec->tensors[i]
rg -n 'memcpy\s*\(rec->tensors\[' src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp
# 3) Confirm Tensor alignment/size
rg -n 'struct\s+alignas\(64\)\s+Tensor|static_assert\(sizeof\(Tensor\)' src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h
# 4) Check whether Tensor is treated as implicit-lifetime / trivially copyable anywhere
rg -n --hidden --no-ignore-vcs 'is_trivially_copyable<.*Tensor|is_trivially_copyable_v<.*Tensor|implicit.?lifetime|start_lifetime_as|std::launder|__builtin_launder' src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp src/a5 | head -n 200
# 5) Find C++ standard flags used by this repo (implicit-lifetime rules depend on it)
rg -n --hidden --no-ignore-vcs --glob '*.{cmake,sh,txt,mk}' '-std=c\+\+[0-9]+' . | head -n 200Repository: hw-native-sys/simpler
Length of output: 1964
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# C++ standard mode affects whether memcpy-written storage is valid for implicit-lifetime types.
echo "== C++ standard flags =="
rg -n --hidden --no-ignore-vcs --glob '*.{cmake,sh,txt,mk,cc,cpp,h,hpp}' '-std=c\+\+[0-9]+' . || true
echo
echo "== Tensor trivial/implicit-lifetime signals (look for traits/asserts) =="
rg -n --hidden --no-ignore-vcs 'is_trivially_copyable.*Tensor|is_trivially_copyable_v.*Tensor|implicit.?lifetime|start_lifetime_as|std::launder|__builtin_launder' \
src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h || true
echo
echo "== Any existing pattern for materializing tensors from byte blobs =="
rg -n --hidden --no-ignore-vcs 'start_lifetime_as<.*Tensor>|placement new.*Tensor|std::launder.*Tensor|bit_cast.*Tensor' src/a5 || trueRepository: hw-native-sys/simpler
Length of output: 1964
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Locate this repo's C++ standard flags (-std=c++*)
rg -n --hidden --no-ignore-vcs --glob '*.{cmake,sh,txt,mk,cc,cpp,h,hpp}' '-std=c\+\+[0-9]+' . || true
# 2) Confirm DepGenRecord tensor blob type is bytes
rg -n --hidden --no-ignore-vcs 'struct DepGenRecord|uint8_t\s+tensors\[' src/a5/platform/include/common/dep_gen.h
# 3) Confirm capture uses memcpy into rec->tensors[i]
rg -n --hidden --no-ignore-vcs 'memcpy\s*\(rec->tensors\[.*\],\s*tensor_ptrs\[' \
src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp
# 4) Check for any traits/asserts about Tensor triviality/implicit-lifetime support
rg -n --hidden --no-ignore-vcs 'is_trivially_copyable.*Tensor|is_trivial.*Tensor|implicit.?lifetime|start_lifetime_as<|std::launder|__builtin_launder' \
src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h \
src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp \
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h || trueRepository: hw-native-sys/simpler
Length of output: 1964
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) C++ standard mode (object-lifetime rules depend on it)
rg -n --hidden --no-ignore-vcs --glob '*.{cmake,sh,txt,mk,cc,cpp,h,hpp}' '-std=c\+\+[0-9]+' . || true
# 2) Confirm tensors[] is raw bytes and capture uses memcpy into it
rg -n --hidden --no-ignore-vcs 'struct DepGenRecord|uint8_t\s+tensors\[' src/a5/platform/include/common/dep_gen.h
rg -n --hidden --no-ignore-vcs 'memcpy\s*\(rec->tensors\[|memcpy\s*\(rec\.tensors\[' src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp
# 3) Look for any Tensor implicit-lifetime / triviality guarantees or existing lifetime-materialization patterns
rg -n --hidden --no-ignore-vcs 'is_trivially_copyable.*Tensor|std::is_trivially_copyable.*Tensor|implicit.?lifetime|start_lifetime_as|std::launder|__builtin_launder' \
src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h \
src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp \
src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_types.h || trueRepository: hw-native-sys/simpler
Length of output: 1964
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1) Locate C++ standard flags (-std=c++*)
rg -n --hidden --no-ignore-vcs --glob '*.{cmake,sh,txt,mk,cc,cpp,h,hpp}' '-std=c\+\+[0-9]+' . || true
# 2) Reconfirm capture writes tensors as bytes
rg -n --hidden --no-ignore-vcs 'memcpy\s*\(rec->tensors\[|memset\s*\(rec->tensors\[|DEP_GEN_TENSOR_SIZE' \
src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp src/a5/platform/include/common/dep_gen.h
# 3) Look for any established implicit-lifetime / object-materialization helpers
rg -n --hidden --no-ignore-vcs 'start_lifetime_as<|std::launder|__builtin_launder|placement new|memcpy\s*\(.*Tensor' \
src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h src/a5/runtime || trueRepository: hw-native-sys/simpler
Length of output: 1964
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Blob type/layout
rg -n 'uint8_t\s+tensors\[' src/a5/platform/include/common/dep_gen.h
rg -n '__attribute__\(\(aligned\(64\)\)\)|offsetof\(DepGenRecord,\s*tensors\)' src/a5/platform/include/common/dep_gen.h
# Capture writes bytes into blob
rg -n 'memcpy\s*\(\s*rec->tensors\[' src/a5/platform/src/aicpu/dep_gen_collector_aicpu.cpp
# Tensor alignment/size and (if present) trivial/implicit-lifetime indicators
rg -n 'struct\s+alignas\(64\)\s+Tensor|static_assert\(sizeof\(Tensor\)' src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h
rg -n 'is_trivially_copyable|is_trivial|implicit.?lifetime|start_lifetime_as|std::launder|__builtin_launder' src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp || true
# C++ standard mode (affects implicit-lifetime rules)
rg -n --hidden --no-ignore-vcs --glob '*.{cmake,sh,txt,mk,cc,cpp,h,hpp}' '-std=c\+\+[0-9]+' . || trueRepository: hw-native-sys/simpler
Length of output: 1964
Avoid object-lifetime UB when replaying Tensor from DepGenRecord::tensors
dep_gen_replay.cpp reinterprets the DepGenRecord::tensors byte blobs (uint8_t tensors[...][DEP_GEN_TENSOR_SIZE]) as const Tensor* (reinterpret_cast<const Tensor *>(&rec.tensors[i][0])) and then dereferences them to read Tensor fields. The capture path fills these blobs via memcpy(rec->tensors[i], ..., DEP_GEN_TENSOR_SIZE) (or zeros for null slots), and the intended layout/alignment is enforced (DepGenRecord aligns tensors[] to 64B; Tensor is alignas(64) with static_assert(sizeof(Tensor) == DEP_GEN_TENSOR_SIZE)). The remaining risk is C++ object-lifetime/implicit-lifetime legality for turning memcpy-written bytes into an active Tensor object; if the build mode doesn’t make this pattern language-safe, replay should materialize into a real aligned Tensor object before dereference.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp` around lines
540 - 543, The loop currently reinterprets raw bytes in DepGenRecord::tensors as
const Tensor* which risks object-lifetime UB; instead allocate a temporary
array/vector of real Tensor objects (e.g. std::vector<Tensor> replay_tensors(tc)
or an aligned buffer of Tensor) and for each i do a memcpy(&replay_tensors[i],
&rec.tensors[i][0], sizeof(Tensor)) to materialize a real aligned Tensor object,
then set tref_buf[i].ptr = &replay_tensors[i] and atype_buf[i] as before (ensure
replay_tensors lives long enough for the replay usage).
| for (size_t j = rec_i + 1; j < num_records; j++) { | ||
| const DepGenRecord &maybe = records[j]; | ||
| if (!(maybe.flags & DEP_GEN_FLAG_OVERFLOW)) { | ||
| LOG_ERROR( | ||
| "dep_gen replay: unterminated overflow chain at rec_idx=%zu (task_id=%" PRIu64 ")", rec_i, | ||
| rec.task_id | ||
| ); | ||
| break; | ||
| } | ||
| if (maybe.task_id != rec.task_id) { | ||
| LOG_ERROR( | ||
| "dep_gen replay: orphan overflow at rec_idx=%zu (expected task_id=%" PRIu64 ", found %" PRIu64 | ||
| ")", | ||
| j, rec.task_id, maybe.task_id | ||
| ); | ||
| break; | ||
| } | ||
| const auto *over = reinterpret_cast<const DepGenOverflowRecord *>(&maybe); | ||
| uint16_t over_dc = over->dep_count; | ||
| if (over_dc > DEP_GEN_OVERFLOW_DEPS_PER_RECORD) { | ||
| LOG_ERROR( | ||
| "dep_gen replay: clamping overflow dep_count %u > %d at rec_idx=%zu (task_id=%" PRIu64 ")", | ||
| over_dc, DEP_GEN_OVERFLOW_DEPS_PER_RECORD, j, rec.task_id | ||
| ); | ||
| over_dc = DEP_GEN_OVERFLOW_DEPS_PER_RECORD; | ||
| } | ||
| full_deps_buf.insert(full_deps_buf.end(), over->deps, over->deps + over_dc); | ||
| if (over->flags & DEP_GEN_FLAG_LAST_OVERFLOW) { | ||
| chain_complete = true; | ||
| break; | ||
| } | ||
| } | ||
| if (!chain_complete) { | ||
| LOG_ERROR( | ||
| "dep_gen replay: chain for task_id=%" PRIu64 " missing LAST_OVERFLOW marker — " | ||
| "using partial dep list (%zu deps)", | ||
| rec.task_id, full_deps_buf.size() | ||
| ); | ||
| } | ||
| deps_data = full_deps_buf.data(); |
There was a problem hiding this comment.
Fail replay on malformed overflow chains instead of writing partial deps.
The orphan/unterminated-chain paths only log and keep going with a truncated explicit-dependency list. That still allows a "successful" deps.json with missing arrows, which is worse than failing the replay the way the oracle-divergence path already does.
Suggested fix
bool chain_complete = false;
+ bool malformed_chain = false;
for (size_t j = rec_i + 1; j < num_records; j++) {
const DepGenRecord &maybe = records[j];
if (!(maybe.flags & DEP_GEN_FLAG_OVERFLOW)) {
LOG_ERROR(
"dep_gen replay: unterminated overflow chain at rec_idx=%zu (task_id=%" PRIu64 ")", rec_i,
rec.task_id
);
+ malformed_chain = true;
break;
}
if (maybe.task_id != rec.task_id) {
LOG_ERROR(
"dep_gen replay: orphan overflow at rec_idx=%zu (expected task_id=%" PRIu64 ", found %" PRIu64
")",
j, rec.task_id, maybe.task_id
);
+ malformed_chain = true;
break;
}
const auto *over = reinterpret_cast<const DepGenOverflowRecord *>(&maybe);
uint16_t over_dc = over->dep_count;
if (over_dc > DEP_GEN_OVERFLOW_DEPS_PER_RECORD) {
@@
if (over->flags & DEP_GEN_FLAG_LAST_OVERFLOW) {
chain_complete = true;
break;
}
}
- if (!chain_complete) {
+ if (malformed_chain || !chain_complete) {
LOG_ERROR(
- "dep_gen replay: chain for task_id=%" PRIu64 " missing LAST_OVERFLOW marker — "
- "using partial dep list (%zu deps)",
+ "dep_gen replay: invalid overflow chain for task_id=%" PRIu64,
rec.task_id, full_deps_buf.size()
);
+ tm_oracle.destroy();
+ tm_annot.destroy();
+ return -7;
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/a5/runtime/tensormap_and_ringbuffer/host/dep_gen_replay.cpp` around lines
568 - 607, The code currently logs orphan/unterminated overflow chains inside
the loop (checking DEP_GEN_FLAG_OVERFLOW and matching task_id) and then
continues, allowing a truncated deps list to be used; instead, when encountering
these malformed chains (or when chain_complete is false after scanning overflow
records) abort the replay immediately rather than proceeding: in the block
handling orphan overflow and the block handling unterminated chains, replace the
LOG_ERROR-only behavior with a hard failure (e.g., return an error status or
throw an exception) from the enclosing function so that
DepGenRecord/DepGenOverflowRecord chains that are malformed do not lead to using
full_deps_buf/deps_data; ensure the failure path prevents setting deps_data and
propagates a clear error for the caller to detect.
| if (is_dep_gen_enabled()) { | ||
| const void *tensor_ptrs[MAX_TENSOR_ARGS]; | ||
| // TensorArgType is `enum class : int32_t` (4 bytes); the on-disk record | ||
| // packs arg_types as uint8_t[16] (5-value enum fits in a byte). Narrow | ||
| // each tag here rather than letting the AICPU writer reinterpret a | ||
| // 4×-wider array as bytes — that path silently lost two of every three | ||
| // tags on little-endian and synthesized phantom self-edges in replay. | ||
| uint8_t arg_types_u8[MAX_TENSOR_ARGS]; | ||
| // Clamp to MAX_TENSOR_ARGS even though the Arg builder caps adds at | ||
| // MAX_TENSOR_ARGS: defensive against any future builder bypass / | ||
| // shared-memory bit-flip that could otherwise overrun the two | ||
| // MAX_TENSOR_ARGS-sized stack buffers above. | ||
| const int tc_raw = args.tensor_count(); | ||
| const int tc = tc_raw > MAX_TENSOR_ARGS ? MAX_TENSOR_ARGS : tc_raw; | ||
| for (int i = 0; i < tc; i++) { | ||
| // OUTPUT slots carry create_info (not yet a Tensor); skip them — | ||
| // they have no producer to look up and replay's per-tensor loop | ||
| // also skips OUTPUT. | ||
| tensor_ptrs[i] = (args.tag(i) == TensorArgType::OUTPUT) ? nullptr : args.tensor(i).ptr; | ||
| arg_types_u8[i] = static_cast<uint8_t>(args.tag(i)); | ||
| } | ||
| dep_gen_aicpu_record_submit( | ||
| task_id.raw, orch->in_manual_scope(), tc, tensor_ptrs, arg_types_u8, | ||
| static_cast<int>(args.explicit_dep_count()), reinterpret_cast<const uint64_t *>(args.explicit_deps_data()) | ||
| ); |
There was a problem hiding this comment.
Record the runtime-filtered explicit deps, not the raw input list.
This capture writes args.explicit_deps_data() verbatim, but the runtime immediately below drops invalid/already-dead deps before they ever participate in fanin. Replay can therefore emit explicit edges that the runtime never enforced, which makes deps.json drift from the real dependency graph.
Also applies to: 534-551
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` around
lines 488 - 512, The code currently records the raw
args.explicit_deps_data()/args.explicit_dep_count() into
dep_gen_aicpu_record_submit, but you must record the runtime-filtered
explicit-deps list (the array and count produced after invalid/already-dead deps
are dropped) so replay matches actual enforced edges; update the
dep_gen_aicpu_record_submit calls (the one shown and the similar call around the
534-551 region) to pass the filtered deps buffer and its filtered count instead
of args.explicit_deps_data() and args.explicit_dep_count(), using the same
filtered-deps variable(s) produced by the runtime’s dep-filtering code path.
| if not deps_path.exists(): | ||
| # Output dir exists but no deps.json — another diagnostic flag was | ||
| # on (e.g. just --enable-l2-swimlane) but not --enable-dep-gen. | ||
| return |
There was a problem hiding this comment.
Fail fast when dep_gen is enabled but deps.json is missing.
Line 139 currently returns silently, which can mask the exact regression this test is supposed to catch when dep_gen is effectively enabled.
Proposed fix
deps_path = out_dir / "deps.json"
- if not deps_path.exists():
- # Output dir exists but no deps.json — another diagnostic flag was
- # on (e.g. just --enable-l2-swimlane) but not --enable-dep-gen.
- return
+ assert deps_path.exists(), (
+ f"dep_gen was enabled but {deps_path} is missing. "
+ "Likely cause: dep_gen capture/replay did not emit the artifact."
+ )
with deps_path.open() as f:
deps = json.load(f)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/st/a5/tensormap_and_ringbuffer/dfx/dep_gen/test_dep_gen.py` around
lines 139 - 142, The code currently returns silently when deps_path (the
expected deps.json) is missing, which hides failures when dep_gen is enabled;
replace the silent return with a fail-fast check (e.g. assert
deps_path.exists(), f"dep_gen enabled but deps.json missing at {deps_path}" or
pytest.fail(...) ) so the test fails with a clear message; ensure pytest is
imported if you use pytest.fail and reference the symbols deps_path, deps.json
and dep_gen in the failure message to aid debugging.
Port the dep_gen (SubmitTrace) feature from a2a3 to a5 so the tensormap_and_ringbuffer runtime on a5 can produce deps.json and feed flow events into swimlane_converter.py. Without this, --enable-dep-gen was a no-op on a5 and merged_swimlane_*.json had no dependency arrows.
Reused from a2a3 verbatim (byte-identical):
Specialized for a5 (no SVM, see profiling_common diff vs a2a3):
Sim path: dlsym set_platform_dep_gen_base / set_dep_gen_enabled out of the AICPU .so and forward kernel_args.dep_gen_data_base + enable flag at boot, mirroring the existing pmu / dump / l2_perf setters.
Onboard kernel.cpp adds two lines to forward dep_gen_data_base + PROFILING_FLAG_DEP_GEN into the AICPU writer's globals, mirroring the existing PMU / L2 / Dump setters.
c_api: run_prepared's enable_dep_gen parameter is no longer ignored — wired to runner->set_dep_gen_enabled() on both onboard and sim.
Tests:
Docs: