Triangular Inverse Kernel (continuation of Zouzias' impl)#830
Triangular Inverse Kernel (continuation of Zouzias' impl)#830MirkoDeVita98 wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new triangular matrix inverse example, including a recursive unrolled AICore kernel, orchestration logic, and a Python test suite. It also updates the benchmark_bgemm test configuration and adds debug logging to the scene test framework. Review feedback suggests removing leftover debug print statements, correcting documentation comments about the configuration tensor layout, and simplifying the kernel dispatch function by removing redundant template parameters.
| print("actual: ", actual[:10]) | ||
| print("diff: ", (actual - expected).abs().mean().item()) |
| num_matrices = params["num_matrices"] | ||
| M = args.M.reshape(num_matrices, 1, n, n).to(torch.float16) | ||
| M_inv = linalg_inv(M) | ||
| print("M_inv (golden):", M_inv.flatten()[:10]) |
There was a problem hiding this comment.
Let drop the print in L123.
| * args[0] = M (INPUT) fp16 triangular matrices [num_matrices, N, N] | ||
| * args[1] = I_neg (INPUT) fp16 negative identity [N, N] | ||
| * args[2] = M_inv (OUTPUT) fp16 result [num_matrices, N, N] | ||
| * args[3] = config (INPUT) int64[3]: [matrix_size, num_matrices, is_lower] |
There was a problem hiding this comment.
The comment incorrectly states that the config tensor has 3 elements. The implementation (lines 832-835) and the test case (line 109) use 4 elements: [matrix_size, num_matrices, is_lower, block_dim]. Please update the comment to reflect the actual layout.
* args[3] = config (INPUT) int64[4]: [matrix_size, num_matrices, is_lower, block_dim]References
- Ensure that comments accurately reflect the code's behavior and design.
| * tensor(0) = M (INPUT) fp16 triangular matrices [num_matrices * N * N] | ||
| * tensor(1) = I_neg (INPUT) fp16 negative identity [N * N] | ||
| * tensor(2) = M_inv (OUTPUT) fp16 result [num_matrices * N * N] | ||
| * tensor(3) = config (INPUT) int64[3]: [matrix_size, num_matrices, is_lower, block_dim] |
There was a problem hiding this comment.
The comment describes the config tensor as int64[3], but then lists 4 parameters. It should be updated to int64[4] to match the implementation and the Python test.
| * tensor(3) = config (INPUT) int64[3]: [matrix_size, num_matrices, is_lower, block_dim] | |
| * tensor(3) = config (INPUT) int64[4]: [matrix_size, num_matrices, is_lower, block_dim] |
References
- Ensure that comments accurately reflect the code's behavior and design.
| template <typename InputT, typename OutputT, uint32_t NumTilesPerCubeIter, bool IsBSND> | ||
| AICORE void run_tri_inv_rec_unroll_per_num_matrices( |
There was a problem hiding this comment.
The template parameters NumTilesPerCubeIter and IsBSND in run_tri_inv_rec_unroll_per_num_matrices are redundant because the function body ignores them and performs its own dispatch using literal values (1, 2, 4 for NumTilesPerCubeIter and true/false for IsBSND). These parameters should be removed from the template signature to avoid confusion and improve maintainability.
References
- Remove unused parameters from functions or APIs to improve code clarity and avoid misleading users.
| @@ -0,0 +1,8 @@ | |||
|
|
|||
There was a problem hiding this comment.
Let's remove this Makefile. I used it for testing purposes (avoids excessive typing :-) )
| for name in output_names: | ||
| actual = getattr(test_args, name) | ||
| expected = getattr(golden_args, name) | ||
| print("actual: ", actual[:10]) |
There was a problem hiding this comment.
Let's remove this diff (it was added for debugging)
| num_matrices = params["num_matrices"] | ||
| M = args.M.reshape(num_matrices, 1, n, n).to(torch.float16) | ||
| M_inv = linalg_inv(M) | ||
| print("M_inv (golden):", M_inv.flatten()[:10]) |
There was a problem hiding this comment.
Let drop the print in L123.
| class TestBenchmarkBgemm(SceneTestCase): | ||
| RTOL = 1e-3 | ||
| ATOL = 1e-3 | ||
| RTOL = 1e-5 |
There was a problem hiding this comment.
Let revert the changes in this file to keep the MR only about the triangular inverse.
fb54569 to
9bbf06e
Compare
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces a complete triangular matrix inversion feature: an AICORE kernel implementing a recursive unrolled algorithm for inverting upper/lower-triangular fp16/bf16 matrices, paired with orchestration dispatch and a test suite for validation. ChangesTriangular Inverse Feature
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
a108100 to
d9154fd
Compare
d9154fd to
244882d
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/test_triangular_inverse.py (1)
91-93: ⚡ Quick winFix the stale matrix-construction comment.
Line 91–Line 93 say the diagonal is set to
[0.5, 1.5], but this code path does not do that. Please update the comment to match the actual behavior (strictly triangular input, unit diagonal handled via+ Iin reference inversion).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/test_triangular_inverse.py` around lines 91 - 93, Update the stale comment describing matrix construction to reflect the actual behavior: note that the code builds strictly triangular fp16 matrices (zeros out the off-triangle and does not set diagonal values to [0.5,1.5]) and that the unit diagonal is introduced only when computing the reference inverse via "+ I" (i.e., the test uses strictly triangular input and adds the identity in the reference inversion). Mention the fp16/strictly-triangular inputs and the "+ I" reference inversion so future readers understand how invertibility is handled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/kernels/aic/kernel_tri_inv_rec_unroll.cpp`:
- Around line 765-786: The switch on matrix_size in the block that calls
runKernelTriInvRecUnroll (cases 16/32/64/128) lacks a default and thus silently
does nothing for unsupported sizes; either add a guarded default branch that
logs an error/throws/returns a failure status (ensuring M_inv is not left stale)
or validate matrix_size earlier in aicpu_orchestration_entry and fail fast with
a clear error; locate the switch using the symbol matrix_size and the
runKernelTriInvRecUnroll instantiations and implement one of these checks so
callers receive a loud, deterministic error for unsupported sizes.
- Line 23: Update the header/comment for args[3] to match the actual config
layout read by the kernel (where config values are extracted into matrix_size,
num_matrices, is_lower, block_dim); change the doc from "int64[3]: [matrix_size,
num_matrices, is_lower]" to "int64[4]: [matrix_size, num_matrices, is_lower,
block_dim]" so it accurately documents the values consumed by the code that
reads config into those four int64 variables.
- Around line 789-830: The wrapper run_tri_inv_rec_unroll_per_num_matrices
currently has unused template parameters NumTilesPerCubeIter and IsBSND; remove
them from its template parameter list so it becomes template<typename InputT,
typename OutputT> AICORE void run_tri_inv_rec_unroll_per_num_matrices(...),
leaving the internal forwards to run_tri_inv_rec_unroll(...) unchanged (those
calls still specify NumTilesPerCubeIter and IsBSND as before). Update any call
sites that instantiate run_tri_inv_rec_unroll_per_num_matrices (e.g., places
that passed <half, half, 1, false>) to drop the unused template arguments so the
wrapper is instantiated only with InputT and OutputT. Ensure symbols to edit:
run_tri_inv_rec_unroll_per_num_matrices and its callers; do not change
run_tri_inv_rec_unroll signature.
- Around line 77-89: The loop over seq_idx is unbounded and can read past
cu_seqlens via cu_seqlens[seq_idx + 1]; change the loop to be bounded by the
number of sequences (or cu_seqlens length) and handle the “not found” case:
accept an additional parameter like seq_count (or cu_seqlens_len) and iterate
using for (uint32_t seq_idx = 0; seq_idx + 1 < seq_count; ++seq_idx) (or check
seq_idx + 1 < cu_seqlens_len before indexing), keep the existing logic that
computes seq_end/seq_len/seq_num_chunks and returns when chunk_idx falls into
the range, and if the loop finishes without finding the chunk return a safe
default or error (e.g., an invalid tile info or throw/assert) to avoid
out-of-bounds reads; update all call sites of GetBSNDVarlenTileInfoFromCuSeqlens
accordingly.
In
`@examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/kernels/orchestration/triangular_inverse_orch.cpp`:
- Line 20: The argument-layout comment is using the wrong tensor index: the
config is read via orch_args.tensor(3) in this file (see uses of
orch_args.tensor(3) around the config handling), but the doc comment currently
labels it as tensor(4) and omits tensor(3); update the comment so the config
line reads "tensor(3) = config (INPUT) int64[4]: [matrix_size, num_matrices,
is_lower, block_dim]" (and ensure other tensor indices in that comment are
sequential and accurate to the orch_args.tensor(N) usages).
---
Nitpick comments:
In
`@examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/test_triangular_inverse.py`:
- Around line 91-93: Update the stale comment describing matrix construction to
reflect the actual behavior: note that the code builds strictly triangular fp16
matrices (zeros out the off-triangle and does not set diagonal values to
[0.5,1.5]) and that the unit diagonal is introduced only when computing the
reference inverse via "+ I" (i.e., the test uses strictly triangular input and
adds the identity in the reference inversion). Mention the
fp16/strictly-triangular inputs and the "+ I" reference inversion so future
readers understand how invertibility is handled.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 42bed645-9f72-456c-a3a0-2d1c29509b75
📒 Files selected for processing (3)
examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/kernels/aic/kernel_tri_inv_rec_unroll.cppexamples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/kernels/orchestration/triangular_inverse_orch.cppexamples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/test_triangular_inverse.py
| * args[0] = M (INPUT) fp16 triangular matrices [num_matrices, N, N] | ||
| * args[1] = I_neg (INPUT) fp16 negative identity [N, N] | ||
| * args[2] = M_inv (OUTPUT) fp16 result [num_matrices, N, N] | ||
| * args[3] = config (INPUT) int64[3]: [matrix_size, num_matrices, is_lower] |
There was a problem hiding this comment.
Stale config layout in header doc. The config is read as four int64 values (matrix_size, num_matrices, is_lower, block_dim at Lines 843-846), but this comment still describes int64[3] and omits block_dim.
📝 Proposed doc fix
- * args[3] = config (INPUT) int64[3]: [matrix_size, num_matrices, is_lower]
+ * args[3] = config (INPUT) int64[4]: [matrix_size, num_matrices, is_lower, block_dim]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| * args[3] = config (INPUT) int64[3]: [matrix_size, num_matrices, is_lower] | |
| * args[3] = config (INPUT) int64[4]: [matrix_size, num_matrices, is_lower, block_dim] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/kernels/aic/kernel_tri_inv_rec_unroll.cpp`
at line 23, Update the header/comment for args[3] to match the actual config
layout read by the kernel (where config values are extracted into matrix_size,
num_matrices, is_lower, block_dim); change the doc from "int64[3]: [matrix_size,
num_matrices, is_lower]" to "int64[4]: [matrix_size, num_matrices, is_lower,
block_dim]" so it accurately documents the values consumed by the code that
reads config into those four int64 variables.
| for (uint32_t seq_idx = 0;; ++seq_idx) { | ||
| const uint32_t seq_end = static_cast<uint32_t>(cu_seqlens[seq_idx + 1]); | ||
| const uint32_t seq_len = seq_end - seq_start; | ||
| const uint32_t seq_num_chunks = CeilDiv(seq_len, matrix_size); | ||
| if (chunk_idx < accumulated_chunks + seq_num_chunks) { | ||
| const uint32_t local_chunk_idx = chunk_idx - accumulated_chunks; | ||
| const uint32_t row_start = seq_start + local_chunk_idx * matrix_size; | ||
| const uint32_t valid_size = min(static_cast<uint32_t>(seq_end - row_start), matrix_size); | ||
| return {row_start * num_bsnd_heads * matrix_size + head_idx * matrix_size, valid_size}; | ||
| } | ||
| accumulated_chunks += seq_num_chunks; | ||
| seq_start = seq_end; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find callers of the varlen helper and any non-null cu_seqlens passed in.
rg -nP --type=cpp -C3 '\bGetBSNDVarlenTileInfoFromCuSeqlens\s*\('
rg -nP --type=cpp -C2 '\bcu_seqlens\b'Repository: hw-native-sys/simpler
Length of output: 17997
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n --hidden --no-ignore-vcs '\bcu_seqlens\b' examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/kernels/aic/kernel_tri_inv_rec_unroll.cpp
rg -n --hidden --no-ignore-vcs '\brunKernelTriInvRecUnroll\s*\(' examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example -S
rg -n --hidden --no-ignore-vcs '\brun_tri_inv_rec_unroll\s*\(' examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example -S
rg -n --hidden --no-ignore-vcs '\btriangular_inverse_orch\b' examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example -S
rg -n --hidden --no-ignore-vcs '\bcu_seqlens\s*=' examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example -S
rg -n --hidden --no-ignore-vcs '\bseqlens\b' examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example -SRepository: hw-native-sys/simpler
Length of output: 2652
Unbounded seq_idx loop can read past cu_seqlens
GetBSNDVarlenTileInfoFromCuSeqlensuses a sentinel-lessfor (uint32_t seq_idx = 0;; ++seq_idx)while indexingcu_seqlens[seq_idx + 1]; ifchunk_idxnever falls into any sequence’s chunk range, the loop will keep walking past the end ofcu_seqlens.- In this example, the varlen helper call is guarded by
if (cu_seqlens != nullptr), and the dispatch APIs defaultcu_seqlenstonullptr, so this path is effectively dormant here; still, the helper is unsafe when reused with malformed/truncated inputs—bound the loop by passing the number of sequences /cu_seqlenslength (or otherwise provide a termination condition).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/kernels/aic/kernel_tri_inv_rec_unroll.cpp`
around lines 77 - 89, The loop over seq_idx is unbounded and can read past
cu_seqlens via cu_seqlens[seq_idx + 1]; change the loop to be bounded by the
number of sequences (or cu_seqlens length) and handle the “not found” case:
accept an additional parameter like seq_count (or cu_seqlens_len) and iterate
using for (uint32_t seq_idx = 0; seq_idx + 1 < seq_count; ++seq_idx) (or check
seq_idx + 1 < cu_seqlens_len before indexing), keep the existing logic that
computes seq_end/seq_len/seq_num_chunks and returns when chunk_idx falls into
the range, and if the loop finishes without finding the chunk return a safe
default or error (e.g., an invalid tile info or throw/assert) to avoid
out-of-bounds reads; update all call sites of GetBSNDVarlenTileInfoFromCuSeqlens
accordingly.
| switch (matrix_size) { | ||
| case 16: | ||
| runKernelTriInvRecUnroll<InputT, OutputT, 16, NumTilesPerCubeIter, IsBSND>( | ||
| tensor_out, tensor_in, minus_eye_in, block_dim, num_matrices, num_bsnd_heads, is_lower, cu_seqlens | ||
| ); | ||
| break; | ||
| case 32: | ||
| runKernelTriInvRecUnroll<InputT, OutputT, 32, NumTilesPerCubeIter, IsBSND>( | ||
| tensor_out, tensor_in, minus_eye_in, block_dim, num_matrices, num_bsnd_heads, is_lower, cu_seqlens | ||
| ); | ||
| break; | ||
| case 64: | ||
| runKernelTriInvRecUnroll<InputT, OutputT, 64, NumTilesPerCubeIter, IsBSND>( | ||
| tensor_out, tensor_in, minus_eye_in, block_dim, num_matrices, num_bsnd_heads, is_lower, cu_seqlens | ||
| ); | ||
| break; | ||
| case 128: | ||
| runKernelTriInvRecUnroll<InputT, OutputT, 128, NumTilesPerCubeIter, IsBSND>( | ||
| tensor_out, tensor_in, minus_eye_in, block_dim, num_matrices, num_bsnd_heads, is_lower, cu_seqlens | ||
| ); | ||
| break; | ||
| } |
There was a problem hiding this comment.
Unsupported matrix_size silently produces no output. The switch has no default, so any matrix_size outside {16, 32, 64, 128} returns without writing M_inv, leaving stale/garbage memory with no signal. Consider validating matrix_size in aicpu_orchestration_entry (or adding a guarded default) so callers fail loudly instead of getting silently incorrect results.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/kernels/aic/kernel_tri_inv_rec_unroll.cpp`
around lines 765 - 786, The switch on matrix_size in the block that calls
runKernelTriInvRecUnroll (cases 16/32/64/128) lacks a default and thus silently
does nothing for unsupported sizes; either add a guarded default branch that
logs an error/throws/returns a failure status (ensuring M_inv is not left stale)
or validate matrix_size earlier in aicpu_orchestration_entry and fail fast with
a clear error; locate the switch using the symbol matrix_size and the
runKernelTriInvRecUnroll instantiations and implement one of these checks so
callers receive a loud, deterministic error for unsupported sizes.
| template <typename InputT, typename OutputT, uint32_t NumTilesPerCubeIter, bool IsBSND> | ||
| AICORE void run_tri_inv_rec_unroll_per_num_matrices( | ||
| __gm__ OutputT *tensor_out, __gm__ InputT *tensor_in, __gm__ InputT *minus_eye_in, uint32_t block_dim, | ||
| uint32_t matrix_size, uint32_t num_matrices, uint32_t num_bsnd_heads, uint32_t is_lower = 0, | ||
| __gm__ int32_t *cu_seqlens = nullptr | ||
| ) { | ||
| if (num_bsnd_heads == 0) { | ||
| if (num_matrices <= block_dim) { | ||
| run_tri_inv_rec_unroll<InputT, OutputT, 1 /* NumTilesPerCubeIter */, false /* IsBSND */>( | ||
| tensor_out, tensor_in, minus_eye_in, block_dim, matrix_size, num_matrices, num_bsnd_heads, is_lower, | ||
| cu_seqlens | ||
| ); | ||
| } else if (num_matrices <= 2 * block_dim) { | ||
| run_tri_inv_rec_unroll<InputT, OutputT, 2 /* NumTilesPerCubeIter */, false /* IsBSND */>( | ||
| tensor_out, tensor_in, minus_eye_in, block_dim, matrix_size, num_matrices, num_bsnd_heads, is_lower, | ||
| cu_seqlens | ||
| ); | ||
| } else { | ||
| run_tri_inv_rec_unroll<InputT, OutputT, 4 /* NumTilesPerCubeIter */, false /* IsBSND */>( | ||
| tensor_out, tensor_in, minus_eye_in, block_dim, matrix_size, num_matrices, num_bsnd_heads, is_lower, | ||
| cu_seqlens | ||
| ); | ||
| } | ||
| } else { | ||
| if (num_matrices <= block_dim) { | ||
| run_tri_inv_rec_unroll<InputT, OutputT, 1 /* NumTilesPerCubeIter */, true /* IsBSND */>( | ||
| tensor_out, tensor_in, minus_eye_in, block_dim, matrix_size, num_matrices, num_bsnd_heads, is_lower, | ||
| cu_seqlens | ||
| ); | ||
| } else if (num_matrices <= 2 * block_dim) { | ||
| run_tri_inv_rec_unroll<InputT, OutputT, 2 /* NumTilesPerCubeIter */, true /* IsBSND */>( | ||
| tensor_out, tensor_in, minus_eye_in, block_dim, matrix_size, num_matrices, num_bsnd_heads, is_lower, | ||
| cu_seqlens | ||
| ); | ||
| } else { | ||
| run_tri_inv_rec_unroll<InputT, OutputT, 4 /* NumTilesPerCubeIter */, true /* IsBSND */>( | ||
| tensor_out, tensor_in, minus_eye_in, block_dim, matrix_size, num_matrices, num_bsnd_heads, is_lower, | ||
| cu_seqlens | ||
| ); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -nP --type=cpp -C2 '\brun_tri_inv_rec_unroll_per_num_matrices\s*<'Repository: hw-native-sys/simpler
Length of output: 898
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE="examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/kernels/aic/kernel_tri_inv_rec_unroll.cpp"
echo "== References to template params in this file =="
rg -n --fixed-strings "NumTilesPerCubeIter" "$FILE" || true
rg -n --fixed-strings "IsBSND" "$FILE" || true
echo "== All call sites using explicit template arguments =="
rg -nP --type=cpp '\brun_tri_inv_rec_unroll_per_num_matrices\s*<' "$FILE" || trueRepository: hw-native-sys/simpler
Length of output: 5231
Remove unused template parameters from run_tri_inv_rec_unroll_per_num_matrices
NumTilesPerCubeIter and IsBSND are not used for any compile-time branching inside run_tri_inv_rec_unroll_per_num_matrices—the body always forwards hardcoded 1/2/4 and false/true selected by runtime num_matrices / num_bsnd_heads, so the <half, half, 1, false> template arguments at the only call site are ignored and misleading.
♻️ Proposed signature + call-site cleanup
-template <typename InputT, typename OutputT, uint32_t NumTilesPerCubeIter, bool IsBSND>
+template <typename InputT, typename OutputT>
AICORE void run_tri_inv_rec_unroll_per_num_matrices(- run_tri_inv_rec_unroll_per_num_matrices<half, half, 1, false>(
+ run_tri_inv_rec_unroll_per_num_matrices<half, half>(
M_inv, M, I_neg, block_dim, matrix_size, num_matrices, 0, is_lower, nullptr
);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/kernels/aic/kernel_tri_inv_rec_unroll.cpp`
around lines 789 - 830, The wrapper run_tri_inv_rec_unroll_per_num_matrices
currently has unused template parameters NumTilesPerCubeIter and IsBSND; remove
them from its template parameter list so it becomes template<typename InputT,
typename OutputT> AICORE void run_tri_inv_rec_unroll_per_num_matrices(...),
leaving the internal forwards to run_tri_inv_rec_unroll(...) unchanged (those
calls still specify NumTilesPerCubeIter and IsBSND as before). Update any call
sites that instantiate run_tri_inv_rec_unroll_per_num_matrices (e.g., places
that passed <half, half, 1, false>) to drop the unused template arguments so the
wrapper is instantiated only with InputT and OutputT. Ensure symbols to edit:
run_tri_inv_rec_unroll_per_num_matrices and its callers; do not change
run_tri_inv_rec_unroll signature.
| * tensor(0) = M (INPUT) fp16 triangular matrices [num_matrices * N * N] | ||
| * tensor(1) = I_neg (INPUT) fp16 negative identity [N * N] | ||
| * tensor(2) = M_inv (OUTPUT) fp16 result [num_matrices * N * N] | ||
| * tensor(4) = config (INPUT) int64[4]: [matrix_size, num_matrices, is_lower, block_dim] |
There was a problem hiding this comment.
Wrong tensor index in arg-layout doc. The config is the 4th argument, read via orch_args.tensor(3) (Lines 47, 49), but the comment labels it tensor(4) and skips tensor(3).
📝 Proposed doc fix
- * tensor(4) = config (INPUT) int64[4]: [matrix_size, num_matrices, is_lower, block_dim]
+ * tensor(3) = config (INPUT) int64[4]: [matrix_size, num_matrices, is_lower, block_dim]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| * tensor(4) = config (INPUT) int64[4]: [matrix_size, num_matrices, is_lower, block_dim] | |
| * tensor(3) = config (INPUT) int64[4]: [matrix_size, num_matrices, is_lower, block_dim] |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/kernels/orchestration/triangular_inverse_orch.cpp`
at line 20, The argument-layout comment is using the wrong tensor index: the
config is read via orch_args.tensor(3) in this file (see uses of
orch_args.tensor(3) around the config handling), but the doc comment currently
labels it as tensor(4) and omits tensor(3); update the comment so the config
line reads "tensor(3) = config (INPUT) int64[4]: [matrix_size, num_matrices,
is_lower, block_dim]" (and ensure other tensor indices in that comment are
sequential and accurate to the orch_args.tensor(N) usages).
To reproduce:
pip install --no-build-isolation -e '.[test]' python examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/test_triangular_inverse.py -p a2a3sim