Feature: Add unified tensor and args dump support#792
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a feature to dump orchestrator-level arguments (tensors and scalars) into an args_dump.json file for the a2a3 and a5 platforms. Key changes include updating the CallConfig structure, Python bindings, and the TensorDumpCollector to handle a new ARGS_DUMP stage. Feedback identifies several critical issues: the collector currently records arena offsets instead of actual scalar values, and the JSON export is missing essential tensor metadata such as shapes and data pointers. Additionally, several leftover debug logs need to be removed, an indentation error in the a5 collector should be corrected, and the flush logic in the a2a3 executor should be updated to a loop for consistency across platforms.
| int orch_thread = thread_num_ - 1; | ||
| if (orch_thread >= sched_thread_num_ && orch_thread < thread_num_) { | ||
| dump_tensor_flush(orch_thread); | ||
| } |
There was a problem hiding this comment.
The logic here only flushes the last thread (thread_num_ - 1), but the comment above (line 757) says it should flush all threads from sched_thread_num_ to thread_num_ - 1. If there are multiple orchestrator threads, some buffers will not be flushed. This should be a loop to maintain consistency with the implementation in other runtimes like a5.
for (int t = sched_thread_num_; t < thread_num_; t++) {
dump_tensor_flush(t);
}References
- For files shared across different runtimes, maintain identical code for consistency.
4da037c to
362c571
Compare
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR extends the tensor dump infrastructure to capture kernel-argument metadata (tensors, scalars, context pointers) at dispatch time, enabling per-dispatch visibility into kernel call signatures. It introduces a new ChangesKernel Arguments Dump Feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/dfx/tensor-dump.md`:
- Around line 38-40: The docs file contains duplicated CLI example lines (e.g.,
"python tests/st/<case>/test_<name>.py -p a5sim --dump-tensor" and "pytest
tests/st/<case> --platform a5sim --dump-tensor") in the code blocks around the
shown diffs and again at lines 47-55; remove the repeated identical lines so
each example appears only once, keeping one instance of each unique command
(e.g., keep a single "python tests/st/<case>/test_<name>.py -p a5sim
--dump-tensor" and a single "pytest tests/st/<case> --platform a5sim
--dump-tensor") and ensure the surrounding code block fences remain valid.
In `@docs/testing.md`:
- Line 113: Update the `--dump-tensor` option row in the table so its
description matches the updated behavior: mention that it dumps tensors and
per-dispatch kernel arguments during runtime execution (e.g., change the
description for `--dump-tensor` to "Dump tensors and per-dispatch kernel args
during runtime execution"). Locate the table entry referencing `--dump-tensor`
and replace the old single-word "Dump tensors" text with the expanded
description to align with nearby sections.
- Line 328: The artifact list item for outputs/<case>_<ts>/tensor_dump/ repeats
the same flag token; remove the duplicate so the parenthetical shows the flag
only once (e.g., change "(--dump-tensor, --dump-tensor)" to "(`--dump-tensor`)"
or to the correct pair if a different second flag was intended), updating the
text that contains the token `--dump-tensor` to a single, accurate mention.
In `@src/a2a3/platform/onboard/aicpu/kernel.cpp`:
- Around line 116-117: The device-side tensor-dump gate is only driven by
PROFILING_FLAG_DUMP_TENSOR via GET_PROFILING_FLAG and set_dump_tensor_enabled in
kernel.cpp, so CLI flag --dump-args (dump_args) never enables the dispatch
capture path; update the plumbing so dump_args toggles the same gate: either map
dump_args to set PROFILING_FLAG_DUMP_TENSOR at runtime or extend the gate to
check the dump_args state (e.g., read dump_args and call set_dump_tensor_enabled
or augment is_dump_tensor_enabled to OR in dump_args). Locate and modify the
call sites around set_dump_tensor_enabled, GET_PROFILING_FLAG,
is_dump_tensor_enabled, and the runtime flag handling for enable_dump_tensor_ /
dump_args to ensure --dump-args independently enables the dump gate.
In `@src/a2a3/platform/sim/host/device_runner.cpp`:
- Around line 112-117: create_temp_so_file() currently hard-fails (unlinks and
returns false) if the system call "patchelf --set-rpath
'/data/software/gcc-15/lib64' <path_buf>" fails, which makes startup
environment-dependent and prevents any subsequent dlopen attempts in
ensure_binaries_loaded(); change this to a best-effort behavior: remove the
fixed constant rpath usage (do not assume /data/software/gcc-15/lib64), attempt
patchelf only if an environment-derived runtime path is available, and if the
std::system call fails, log a non-fatal warning and continue (do not unlink
path_buf or return false) so callers like ensure_binaries_loaded() can proceed
to try dlopen; use the existing identifiers create_temp_so_file(), path_buf,
gcc_runtime and ensure_binaries_loaded() to locate and update the logic.
In `@src/a2a3/platform/src/host/tensor_dump_collector.cpp`:
- Around line 186-189: Clamp rec.ndims to PLATFORM_DUMP_MAX_DIMS before storing
and copying shapes: set entry.ndims = std::min(rec.ndims,
PLATFORM_DUMP_MAX_DIMS) (or equivalent) and then copy shapes only for d <
entry.ndims into entry.shapes; apply the same clamp-and-copy fix to the other
occurrence referenced (lines ~533-536) so serialization cannot read past
entry.shapes.
- Line 719: The vector kernel_args_entries_ is only cleared on normal export and
can leak stale entries when the finalize/reset path is taken; update the
finalize and reset code paths (e.g. the finalize() and/or reset() blocks
surrounding the existing kernel_args_entries_.clear() call) to also call
kernel_args_entries_.clear() (and any associated cleanup like freeing entries if
necessary) so that kernel_args_entries_ is emptied regardless of successful
export.
In `@src/a5/platform/src/host/tensor_dump_collector.cpp`:
- Around line 253-256: Clamp rec.ndims to PLATFORM_DUMP_MAX_DIMS before
assigning to entry.ndims and before copying into entry.shapes to prevent
writing/reading past the fixed-size buffer; specifically, replace the direct
assignment entry.ndims = rec.ndims with something like entry.ndims =
std::min(rec.ndims, PLATFORM_DUMP_MAX_DIMS) and use that clamped value in the
loop that copies rec.shapes into entry.shapes (the same change should be applied
to the other occurrence handling ndims/shapes around the rec use at the second
block mentioned).
- Line 763: In the TensorDumpCollector reset path, ensure kernel_args_entries_
is cleared alongside collected_: inside the same finalize/reset method where
collected_ is cleared (e.g., TensorDumpCollector::FinalizeReset or the
reset/Finalize method that currently calls collected_.clear()), add
kernel_args_entries_.clear(); so kernel argument entries cannot persist between
runs; keep placement and locking consistent with the existing collected_
clearing to preserve thread-safety.
In `@src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp`:
- Around line 710-714: The flush loop starts at sched_thread_num_ and so omits
scheduler threads (they use thread_idx < sched_thread_num_); change the loop in
the is_dump_tensor_enabled() block to iterate over all thread indices that may
emit dumps (e.g., start at 0 and loop to aicpu_thread_num_) so
dump_tensor_flush(t) is called for scheduler-thread indices as well; update the
loop that currently uses sched_thread_num_ as the start to cover the full range
of 0..aicpu_thread_num_-1.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 319fdfc4-6541-418c-ae9e-6c9048793732
📒 Files selected for processing (27)
docs/dfx/tensor-dump.mddocs/testing.mdsimpler_setup/tools/dump_viewer.pysrc/a2a3/platform/include/aicpu/tensor_dump_aicpu.hsrc/a2a3/platform/include/common/tensor_dump.hsrc/a2a3/platform/include/host/tensor_dump_collector.hsrc/a2a3/platform/onboard/aicpu/kernel.cppsrc/a2a3/platform/onboard/host/device_runner.cppsrc/a2a3/platform/sim/host/device_runner.cppsrc/a2a3/platform/src/aicpu/tensor_dump_aicpu.cppsrc/a2a3/platform/src/host/tensor_dump_collector.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cppsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.hsrc/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cppsrc/a5/platform/include/aicpu/tensor_dump_aicpu.hsrc/a5/platform/include/common/tensor_dump.hsrc/a5/platform/include/host/tensor_dump_collector.hsrc/a5/platform/onboard/aicpu/kernel.cppsrc/a5/platform/sim/host/device_runner.cppsrc/a5/platform/src/aicpu/tensor_dump_aicpu.cppsrc/a5/platform/src/host/tensor_dump_collector.cppsrc/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cppsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_context.hsrc/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_dispatch.cpptests/st/a2a3/tensormap_and_ringbuffer/dfx/tensor_dump/test_tensor_dump.py
| const char *gcc_runtime = "/data/software/gcc-15/lib64"; | ||
| int rc = std::system((std::string("patchelf --set-rpath '") + gcc_runtime + "' '" + path_buf.data() + "'").c_str()); | ||
| if (rc != 0) { | ||
| unlink(path_buf.data()); | ||
| return false; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify hard dependency points in source
rg -n "patchelf --set-rpath|/data/software/gcc-15/lib64" --type=cpp
# Verify host/toolchain assumptions in current environment
if command -v patchelf >/dev/null 2>&1; then
patchelf --version
else
echo "patchelf not found in PATH"
fi
if [ -d /data/software/gcc-15/lib64 ]; then
ls -ld /data/software/gcc-15/lib64
else
echo "/data/software/gcc-15/lib64 does not exist"
fiRepository: hw-native-sys/simpler
Length of output: 415
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="src/a2a3/platform/sim/host/device_runner.cpp"
# Find the function(s) that contain this snippet and where create_temp_so_file is defined/used
rg -n "create_temp_so_file" "$FILE"
rg -n "unlink\(path_buf\.data\(\)\)|patchelf --set-rpath|gcc_runtime" "$FILE"
# Show the surrounding code for the rpath block (lines ~90-140) and the full create_temp_so_file function
sed -n '90,150p' "$FILE"
# Extract create_temp_so_file body by printing from its signature until the next matching brace depth is too hard;
# instead, print a broader window around the first occurrence.
# This keeps output small while still giving context for return handling.
line=$(rg -n "create_temp_so_file" "$FILE" | head -n1 | cut -d: -f1)
start=$((line-60))
end=$((line+140))
sed -n "${start},${end}p" "$FILE" | sed -n '1,240p'Repository: hw-native-sys/simpler
Length of output: 9981
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="src/a2a3/platform/sim/host/device_runner.cpp"
rg -n "LOG_WARN|LOG_INFO|LOG_DEBUG|LOG_ERROR" "$FILE" --type=cppRepository: hw-native-sys/simpler
Length of output: 5541
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="src/a2a3/platform/sim/host/device_runner.cpp"
sed -n '190,245p' "$FILE"
sed -n '295,345p' "$FILE"
sed -n '1080,1145p' "$FILE"Repository: hw-native-sys/simpler
Length of output: 7519
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "LOG_WARN(_|\\b)|LOG_WARNING(_|\\b)|WARN\\(" -S . --glob '!**/node_modules/**' --glob '!**/dist/**' --glob '!**/build/**' --glob '!**/.git/**' --glob '!**/vendor/**'Repository: hw-native-sys/simpler
Length of output: 11638
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="src/a2a3/platform/sim/host/device_runner.cpp"
# Show the include section and any logging headers used
sed -n '1,120p' "$FILE"Repository: hw-native-sys/simpler
Length of output: 4682
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="src/a2a3/platform/sim/host/device_runner.h"
ls -l "$FILE" || true
sed -n '1,200p' "$FILE" 2>/dev/null || true
rg -n "LOG_WARN|unified_log|LOG_ERROR|LOG_DEBUG" "$FILE" --type=hRepository: hw-native-sys/simpler
Length of output: 7852
Hard-failing patchelf + fixed /data/software/gcc-15/lib64 makes sim startup environment-dependent (blocks dlopen)
create_temp_so_file() unlinks the temp .so and returns false when std::system("patchelf --set-rpath ...") fails (lines ~112-117), and all callers treat that as fatal—so ensure_binaries_loaded() aborts before any dlopen() attempt.
💡 Suggested fix (best-effort patching instead of hard fail)
- const char *gcc_runtime = "/data/software/gcc-15/lib64";
- int rc = std::system((std::string("patchelf --set-rpath '") + gcc_runtime + "' '" + path_buf.data() + "'").c_str());
- if (rc != 0) {
- unlink(path_buf.data());
- return false;
- }
+ const char *gcc_runtime = "/data/software/gcc-15/lib64";
+ int rc = std::system((std::string("patchelf --set-rpath '") + gcc_runtime + "' '" + path_buf.data() + "'").c_str());
+ if (rc != 0) {
+ LOG_WARN("patchelf --set-rpath failed (rc=%d), continuing without rpath patch", rc);
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const char *gcc_runtime = "/data/software/gcc-15/lib64"; | |
| int rc = std::system((std::string("patchelf --set-rpath '") + gcc_runtime + "' '" + path_buf.data() + "'").c_str()); | |
| if (rc != 0) { | |
| unlink(path_buf.data()); | |
| return false; | |
| } | |
| const char *gcc_runtime = "/data/software/gcc-15/lib64"; | |
| int rc = std::system((std::string("patchelf --set-rpath '") + gcc_runtime + "' '" + path_buf.data() + "'").c_str()); | |
| if (rc != 0) { | |
| LOG_WARN("patchelf --set-rpath failed (rc=%d), continuing without rpath patch", rc); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/a2a3/platform/sim/host/device_runner.cpp` around lines 112 - 117,
create_temp_so_file() currently hard-fails (unlinks and returns false) if the
system call "patchelf --set-rpath '/data/software/gcc-15/lib64' <path_buf>"
fails, which makes startup environment-dependent and prevents any subsequent
dlopen attempts in ensure_binaries_loaded(); change this to a best-effort
behavior: remove the fixed constant rpath usage (do not assume
/data/software/gcc-15/lib64), attempt patchelf only if an environment-derived
runtime path is available, and if the std::system call fails, log a non-fatal
warning and continue (do not unlink path_buf or return false) so callers like
ensure_binaries_loaded() can proceed to try dlopen; use the existing identifiers
create_temp_so_file(), path_buf, gcc_runtime and ensure_binaries_loaded() to
locate and update the logic.
bd7fd6d to
ec69d20
Compare
ChaoZheng109
left a comment
There was a problem hiding this comment.
逐点见行内评论。总方向:删掉独立 kernel_args_dump.json 链路,把 scalar 折叠进现有 tensor_dump.json(BEFORE_DISPATCH 不再跳过 scalar + 给每条 entry 加 kind 字段);格式保持「信息全」,由 PR #821 改 parser 来适配我们,而不是把我们的格式改复杂去迁就它。另外 patchelf / 缩进 / no-op 三处脏改动,无论方案如何都要删。
| return false; | ||
| } | ||
|
|
||
| const char *gcc_runtime = "/data/software/gcc-15/lib64"; |
There was a problem hiding this comment.
硬编码本机绝对路径 /data/software/gcc-15/lib64 + std::system(patchelf ...),换台服务器 / CI 直接失效,而且和本特性毫无关系——这是调试残留,请删除整段。顺带这处还删掉了上面 fchmod 那行有用的注释,一并恢复。
| if (aicpu_execute_func_ == nullptr || aicore_execute_func_ == nullptr || set_platform_regs_func_ == nullptr || | ||
| set_platform_dump_base_func_ == nullptr || set_dump_tensor_enabled_func_ == nullptr || | ||
| set_platform_pmu_base_func_ == nullptr || set_pmu_enabled_func_ == nullptr) { | ||
| set_platform_pmu_base_func_ == nullptr || |
There was a problem hiding this comment.
clang-format 失手:这行被顶到了第 0 列,缩进错乱。clang-format -i 修一下即可。
| set_platform_regs(k_args->regs); | ||
| set_platform_dump_base(k_args->dump_data_base); | ||
| set_dump_tensor_enabled(GET_PROFILING_FLAG(k_args->enable_profiling_flag, PROFILING_FLAG_DUMP_TENSOR)); | ||
| bool dump_enabled = GET_PROFILING_FLAG(k_args->enable_profiling_flag, PROFILING_FLAG_DUMP_TENSOR); |
There was a problem hiding this comment.
no-op 改动:只是把原来一行 set_dump_tensor_enabled(GET_PROFILING_FLAG(...)) 拆成先存 bool dump_enabled 再传入,行为完全等价。请还原(a2a3 同处一致)。
| for (int32_t sig_idx = 0; sig_idx < callable.sig_count(); sig_idx++) { | ||
| ArgDirection dir = callable.sig(sig_idx); | ||
| if (dir == ArgDirection::SCALAR) { | ||
| if (stage == TensorDumpStage::KERNEL_ARGS_DUMP) { |
There was a problem hiding this comment.
关键点:现在是新开 KERNEL_ARGS_DUMP 分支来采 scalar。建议改为——在现有 BEFORE_DISPATCH 流程里别再 continue 跳过 SCALAR 槽,直接产一条 record(kind=scalar、value=pl.scalars[scalar_index]、不进 .bin)。这样无需新 stage / 新调用点 / dispatch_id,产物统一进 tensor_dump.json。scalar 本来也有 before/after,它只是 dump 内容的扩展,不是新的产物。
a05805d to
80e5935
Compare
2fa9b77 to
6ee29a2
Compare
9a223cd to
5d0d3a9
Compare
…func_id/subtask_id Scalar dump now reads directly from payload scalar_count/scalars[] rather than iterating signature for D.SCALAR declarations. This makes scalar capture work for any kernel without requiring D.SCALAR in the callable signature. TensorDumpRecord/TensorDumpInfo/DumpedTensor drop func_id and subtask_id fields (unreliable in mix tasks). static_assert updated to 192 bytes. dump_viewer and docs updated to match the reduced schema.
Summary
This PR adds scalar value dump support to the unified tensor dump pipeline.
--dump-tensornow captures scalar arguments at theBEFORE_DISPATCHstage alongside tensor metadata, using a newkindfield to distinguishtensorfromscalarrecords. Thekernel_args_dump.jsonpath is removed; all dump output goes totensor_dump.json+tensor_dump.bin. Supports A2A3 and A5 sim paths.Design
dump_tensors_for_tasktraversal, no new stage needed.kindfield —TensorDumpRecordand JSON manifest usekind: "tensor" | "scalar"to distinguish record types.scalar_valuefield stores uint64 raw value;bin_size = 0indicates no binary payload.tensor_dump.json+tensor_dump.bin, no separate kernel_args_dump.json.Scope
~14 files across A2A3 and A5 device/runtime/platform layers.
Verification
PagedAttention SmallCase validation: