Add manual-scope dependency mode to tensormap runtime#482
Add manual-scope dependency mode to tensormap runtime#482uv-xiao wants to merge 35 commits intohw-native-sys:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces manual dependency tracking to the tensormap_and_ringbuffer runtime, allowing for a hybrid model where same-scope dependencies are explicitly defined while cross-scope relations continue to use TensorMap discovery. The implementation includes significant updates to the orchestrator and scheduler to handle deferred task publication and manual edge recording, as well as a repository-wide restructuring of examples and tests to follow a new {arch}/{runtime} directory convention. Review feedback correctly identified critical issues with bitmask widths and bitwise shifts that could lead to overflows when tracking more than 16 tensor arguments. Suggestions were also made to improve memory safety during buffer reallocation and to simplify logic using standard library functions.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h
Outdated
Show resolved
Hide resolved
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
Outdated
Show resolved
Hide resolved
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
Outdated
Show resolved
Hide resolved
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
Outdated
Show resolved
Hide resolved
...and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp
Outdated
Show resolved
Hide resolved
Reviewer Guide: what this PR adds, how it works, and what still needs workThis draft adds a manual dependency mode to The design goal is a hybrid model:
So the mental model is:
1. User-facing feature: explicit manual scope mode
This PR adds an enum-based manual mode: PTO2_SCOPE(PTO2ScopeMode::MANUAL) {
...
}and separate manual submit APIs that return task ids: PTO2ManualSubmitResult qk = pto2_rt_submit_aic_task_manual(FUNC_QK_MATMUL, args_qk);
PTO2ManualSubmitResult pv = pto2_rt_submit_aic_task_manual(FUNC_PV_MATMUL, args_pv);
pto2_rt_add_dependency(qk.task_id, pv.task_id);Why separate APIs instead of changing
2. What changed semanticallyThe important semantic split is:
That means this is not “disable TensorMap in manual scope”. That would be wrong for cross-scope correctness. 3. Visual flow4. Minimal exampleA small example of the intended usage pattern: PTO2_SCOPE() {
Tensor out = make_tensor_external(out_ptr, out_shape, 2, DataType::FLOAT16);
PTO2_SCOPE(PTO2ScopeMode::MANUAL) {
Tensor tmp = create_tensor(tmp_shape, 2, DataType::FLOAT16);
Arg qk_args;
qk_args.add_input(q);
qk_args.add_input(k);
qk_args.add_output(tmp);
auto qk = pto2_rt_submit_aic_task_manual(FUNC_QK, qk_args);
Arg up_args;
up_args.add_input(tmp);
up_args.add_inout(out);
auto up = pto2_rt_submit_aiv_task_manual(FUNC_UPDATE, up_args);
pto2_rt_add_dependency(qk.task_id, up.task_id);
}
}Interpretation:
5.
|
| Arg kind | Meaning | Manual-scope behavior |
|---|---|---|
INPUT |
read existing tensor | still seeds dependency from TensorMap / owner if outer |
INOUT |
read-modify-write existing tensor | still gets incoming boundary deps if outer, and still updates writer frontier |
OUTPUT_EXISTING |
overwrite existing outer tensor | no incoming overlap lookup, but still updates outgoing writer frontier |
OUTPUT |
fresh runtime-created tensor | no incoming dependency; later same-scope users must be wired explicitly |
Short version:
- manual mode only replaces same-scope auto-derivation
- manual mode does not remove outer-boundary correctness
6. What state the runtime now maintains
The runtime keeps manual-scope state in a narrow form:
scope_tasks[]: tasks owned by the current scopemanual_edges[]: explicit same-scope producer -> consumer edgesmanual_task_meta[]: compact per-task finalize metadata
At manual scope_end, it iterates tasks in submit order and:
- merges cached external producers with explicit same-scope edges
- dedups them
- realizes fanin/fanout into the scheduler
- batch-publishes the tasks
This is why the current bottleneck is still concentrated at manual scope_end.
7. What manual_dep=true is, and what it is not
This PR also uses manual_dep=true in the partial-manual paged-attention example, but that should be read carefully.
It is not the semantic definition of manual mode.
It is only a per-tensor escape hatch that:
- skips TensorMap overlap lookup/insert for that tensor
- still keeps creator retention via
owner_task_id
In the paged-attention example, the useful optimization was not “mark everything manual”.
The stable gain came mainly from suppressing repeated external-output overlap tracking on out / out_view, where same-scope ordering was already explicit.
8. What was added beyond the runtime APIs
This PR also adds:
- partial-manual paged-attention scenes
- guard coverage for invalid patterns:
- nested manual scope
- blocking tensor access inside manual scope
- self-dependency
- outer-write boundary regression coverage
- benchmark support for the partial-manual scenes
- doc updates and example-layout cleanup
9. Performance summary
Current state is better than the modified AUTO path on non-unroll paged attention, but still not at the target.
Key branch-local results from the design doc:
| Workload | Case | aicpu_build_graph |
tensormap_and_ringbuffer |
partial_manual |
|---|---|---|---|---|
paged_attention |
Case1 |
31318.9 us |
36996.3 us |
35187.6 us |
paged_attention |
Case2 |
16844.5 us |
19861.8 us |
18685.5 us |
paged_attention_unroll |
Case1 |
1412.7 us |
1323.9 us |
1321.3 us |
paged_attention_unroll |
Case2 |
705.5 us |
632.5 us |
637.5 us |
And the fresh rerun on current branch state gave:
paged_attention / Case1aicpu_build_graph = 30441.1 uspartial_manual = 34961.3 us
paged_attention / Case2aicpu_build_graph = 16832.6 uspartial_manual = 18144.5 us
So the main open problem is still:
- non-unroll paged-attention does not yet match
aicpu_build_graph - the remaining gap is concentrated in the manual
scope_endreplay / publish path
10. Why this PR is draft
This draft is ready for design and implementation review, but not yet for “performance target achieved”.
The key thing to keep in mind while reading the code is:
- this PR is trying to add explicit same-scope control
- while preserving TensorMap boundary correctness
- and keeping AUTO mode as the unchanged default path
|
Posting the delta since the last PR update. What changed since the previous commentThe main runtime hot path has been simplified further: Concretely, the recent changes were:
The intended mental model is now: Fresh comparisonFresh rerun settings:
Units are
|
| Case | aicpu_build_graph |
tensormap_and_ringbuffer_unmodified |
tensormap_and_ringbuffer |
tensormap_and_ringbuffer_partial_manual |
|---|---|---|---|---|
Case1 |
31037.8 |
36992.8 (36991.9) |
36791.2 (36790.5) |
31563.9 (31407.2) |
Case2 |
16719.2 |
18753.6 (18752.8) |
18615.9 (18615.1) |
16757.6 (16343.9) |
paged_attention_unroll
| Case | aicpu_build_graph |
tensormap_and_ringbuffer_unmodified |
tensormap_and_ringbuffer |
tensormap_and_ringbuffer_partial_manual |
|---|---|---|---|---|
Case1 |
1421.2 |
1320.0 (853.6) |
1322.5 (820.0) |
1327.0 (835.5) |
Case2 |
707.8 |
632.5 (383.5) |
635.9 (391.8) |
633.7 (365.5) |
What the new numbers mean
-
AUTO remains effectively zero-overhead versus the untouched tensormap baseline.
paged_attention/Case1:36791.2vs36992.8(-0.5%)paged_attention/Case2:18615.9vs18753.6(-0.7%)
-
Partial-manual now closes the non-unroll gap to
aicpu_build_graph.Case1:31563.9vs31037.8(+1.7%)Case2:16757.6vs16719.2(+0.2%)
-
On
paged_attention_unroll, the AUTO path was already amortizing most of the orchestration cost, so partial-manual brings little extra benefit there. That is expected.
Bottom line
The previous PR comment said the non-unroll target was still not met because manual mode was paying a large serial scope_end replay/publish cost.
That is no longer the case on the fresh rerun set above:
- AUTO keeps the zero-overhead property
- partial-manual is now in the same performance band as
aicpu_build_graphon non-unroll paged attention - the design doc has been updated to match the current implementation and benchmark matrix
- Capture the hybrid scoped model for tensormap_and_ringbuffer - Define same-scope explicit edges versus cross-scope TensorMap behavior - Record ownership, scope, nesting, and testing constraints before implementation
- Force outer-scope reads in manual scope through TensorMap boundary seeding - Remove the invalid inner-created outer-alias case and keep Tensor layout unchanged - Add explicit scope, tooling, and narrow-change requirements for the implementation PR
- add PTO2ScopeMode::MANUAL, manual submit APIs, and deferred\n scope_end replay for tensormap_and_ringbuffer\n- add paged_attention_partial_manual plus paged_attention*_partial_manual\n ST coverage for nested outer-normal and inner-manual scopes\n- repoint AGENTS.md/CLAUDE.md toward the .agents layout and add a\n placeholder so the directory is tracked
- add a benchmark_rounds runtime selector for the partial-manual scenes\n- keep the current tensormap runtime on the direct selector path\n- record fresh 2026-04-06 hardware comparison data in the design doc
- replace the interim comparison table with the newest fresh device-2 reruns\n- keep the benchmark workflow section aligned with the partial-manual selector\n- record the remaining AUTO-path and partial-manual performance gaps
- cache manual-local tensor classification to avoid repeated scope scans - fuse manual publish and scope-end release into one scheduler pass - limit manual scope sync to active rings and keep submit-path work deferred - chunk paged_attention_partial_manual scopes and add carry deps between updates
Reduce the manual-scope chunk size in the heavy paged-attention\nscene and drop the extra cross-update dependency chain. The\nprevious chunking shape can deadlock the dep pool under benchmark\nload, while the smaller chunk keeps the benchmark path stable and\nlowers the manual-scope overhead.
Manual submit must stay a cheap metadata-recording path and defer\nTensorMap lookup/insert plus dep-pool fanin wiring to manual\nscope_end.\n\nThis reverts the duplicated submit-time work from 0fd6fbc and\nrestores the separate publish/on_scope_end order so manual scopes\ndo not attach fanout edges after releasing their scope reference.
- update the manual dependency design doc to make submit-time boundary discovery and TensorMap-free manual scope_end explicit - cache and retain external producers during manual submit, then merge them with explicit manual edges at publish time - keep the heavy manual paged-attention benchmark on device 3 moving in the right direction without changing example code
- dedupe explicit manual edges when they are recorded and keep an exact incoming edge count per consumer - append local explicit producers directly at scope_end and skip lock/task_state checks for unpublished same-scope producers - keep overflow validation and dep-pool publish ordering unchanged so the optimization stays within existing scheduler invariants
- rewrite the non-unroll partial-manual example to use one manual scope per q tile - move the hub allocation into the manual scope and serialize update tasks explicitly - drop the chunked manual-scope pattern that inflated partial-manual orchestration cost
- mark external inputs and outputs as manual-dep boundaries in the non-unroll partial-manual paged attention example - skip repeated overlap tracking for query, kv-cache, and final output views that are already ordered by the explicit manual dependency chain - keep the manual-scope methodology while improving orch time on real device for both paged-attention cases
- replace the stale benchmark section with fresh 2026-04-08 device-2 results - document how the benchmark wrapper selects new, unmodified, and partial-manual variants - record that partial-manual improved on non-unroll but still misses the aicpu_build_graph target
- replace stale benchmark data in the design doc with fresh device-3 measurements for the four paged-attention runtime lanes - document how benchmark_rounds.sh selects the partial-manual scenes - record the non-unroll boundary-hint A/B results and the safety limits of using manual_dep=true as an example-level boundary annotation
- explain submit-time versus scope-end work in the manual-scope path - document how in-scope and cross-scope tensors are classified and handled - bind the kept example optimizations to measured orch gains and remove stale scope-end frontier wording
- add a short rationale section tying each manual-scope rule to the incorrect or too-expensive alternative it avoids - keep the design doc aligned with the current implemented split between TensorMap boundary discovery and scope-end explicit-edge replay
- delete the copied tensormap_and_ringbuffer_unmodified runtime and ST scenes\n- keep branch docs and benchmark helpers limited to supported runtimes\n- enforce examples/{arch}/{runtime}/{name} in tracked command docs\n- rewrite example and ST path references to use explicit arch prefixes
- add a small hardware test helper that respects PTO_TEST_DEVICE_ID\n- fall back to the lowest NPU with no running processes from npu-smi\n- avoid blocking manual-scope hardware tests behind busy device 0 on shared machines
- widen manual-local masks to 64 bits so manual submit handles all\n MAX_TENSOR_ARGS entries without truncation\n- keep realloc failures from dropping live metadata buffers by setting a\n fatal out-of-memory runtime error before returning\n- simplify the partial-manual paged-attention valid_len calculation with\n std::min
- realize external producer fanout during manual submit while keeping the publish barrier at scope_end - shrink manual scope_end to replay only same-scope explicit edges - keep manual-scope validation and boundary semantics unchanged
- remove dead manual replay metadata from the manual-scope path - skip tensormap sync when a manual submit stays fully in-scope - keep only the publish barrier and dep-pool watermark fixup at scope_end On device 4 with 5 rounds, paged_attention_partial_manual improved from 35.27 ms / 35.12 ms orch to 31.60 ms / 31.44 ms orch for Case1, and from 19.80 ms / 19.38 ms orch to 17.93 ms / 17.49 ms orch for Case2.
- rewrite the design note to match the current manual submit and publish-only\n scope_end implementation\n- record the fresh four-way paged-attention comparison and benchmark\n entrypoints, including the detached worktree flow for the old runtime\n- remove the dead manual_dep_pool_reserve state from the orchestrator
- replace stale replay-heavy scope_end description with the current\n submit-time wiring model\n- document how AUTO, MANUAL, and benchmark selectors map onto the\n paged-attention scenes\n- record the fresh device-6 four-runtime comparison and the observed\n gains for partial-manual and zero-overhead AUTO
Merge manual scope-end validation and dep-pool watermark repair into a\nsingle pass.\n\nThis keeps the manual publish path behavior unchanged while trimming one\nserial walk over the scope task list.
- repair the rebased tensormap submit prologue and task-id write-back\n- restore partial-manual hub kernel sources under the example trees\n- repoint the partial-manual configs so hardware benchmarks build again
- update the rebased partial-manual unroll orchestration to match the\n current qk/pv kernel ABI by passing block_table as a tensor plus\n bt_offset as a scalar\n- rerun the device validation for both unroll cases after the fix\n- refresh the design doc with the rebase root cause and the new 4-way\n benchmark results on device 4 with PTO-ISA d96c8784
30b706a to
e5fa1bc
Compare
|
Posting the current merge-forward progress after force-updating the PR head to Rebase statusThis PR is now rebased onto current The history was rewritten on purpose: the old PR head was based on an older mainline, while the current branch replays the manual-scope stack onto the newest runtime/sim/test infrastructure. What remains the actual feature workThese are the branch changes that are still the core of this PR:
What changed because of the rebase to newest mainThese are not new feature goals; they are alignment work required to land the branch on top of current main:
Rebase-specific fix in the current tipThe latest commit ( The issue was: The fix updates the partial-manual unroll orchestration to pass the same logical inputs the rebased mainline kernels now expect:
So the failure came from stale example-side orchestration after the rebase, not from the manual-scope dependency runtime itself. Verification after the rebase fixVerified on real device after the ABI fix:
The design note has also been refreshed with the post-rebase benchmark commands and current comparison tables. Reviewer-facing takeawayThe important thing to read in this force-push is:
Fresh rebased performance tableSettings:
Units are
|
| Case | aicpu_build_graph |
tensormap_and_ringbuffer_unmodified |
tensormap_and_ringbuffer |
tensormap_and_ringbuffer_partial_manual |
|---|---|---|---|---|
Case1 |
29937.7 |
36095.9 (36094.9) |
39148.7 (39148.3) |
34186.3 (34025.7) |
Case2 |
16762.7 |
18639.5 (18635.1) |
19813.0 (19812.7) |
18028.7 (17618.4) |
paged_attention_unroll
| Case | aicpu_build_graph |
tensormap_and_ringbuffer_unmodified |
tensormap_and_ringbuffer |
tensormap_and_ringbuffer_partial_manual |
|---|---|---|---|---|
Case1 |
1425.3 |
1325.6 (835.3) |
1173.2 (992.0) |
1160.4 (968.8) |
Case2 |
693.0 |
628.7 (380.7) |
567.9 (435.6) |
561.9 (416.6) |
Summary
PTO2ScopeModesoPTO2_SCOPE()stays AUTO by default andPTO2_SCOPE(PTO2ScopeMode::MANUAL)enables explicit same-scope dependenciestensormap_and_ringbufferruntime while keeping cross-scope dependency discovery on TensorMapdocs/manual-dep-for-tensormap-design.md*_unmodifiedbaseline out of the branch and treat it as a worktree-only comparison pointDesign Reference
The major design reference for this draft is:
docs/manual-dep-for-tensormap-design.mdThe intended model is:
This is intentionally not a full port of
aicpu_build_graphinto the PTO2 runtime.Current Status
The original non-unroll paged-attention performance target is now met on the fresh rerun set recorded in the design note.
Fresh reruns with
a2a3, device6,-n 5, and-c 6622890:paged_attention / Case1aicpu_build_graph = 31037.8 ustensormap_and_ringbuffer_unmodified = 36992.8 us (orch 36991.9)tensormap_and_ringbuffer = 36791.2 us (orch 36790.5)tensormap_and_ringbuffer_partial_manual = 31563.9 us (orch 31407.2)paged_attention / Case2aicpu_build_graph = 16719.2 ustensormap_and_ringbuffer_unmodified = 18753.6 us (orch 18752.8)tensormap_and_ringbuffer = 18615.9 us (orch 18615.1)tensormap_and_ringbuffer_partial_manual = 16757.6 us (orch 16343.9)That puts partial-manual within about
+1.7%/+0.2%ofaicpu_build_graphon the target non-unroll workload, while the AUTO path remains effectively zero-overhead versus the untouched tensormap baseline.This PR is still draft because the branch now needs a merge-forward onto current
main: rebasing ontoupstream/mainhits conflicts in the same runtime/manual-scope files and should be resolved before marking the PR ready.What This PR Changes
pto2_rt_add_dependency(...)inside manual scopesscope_end()has been reduced to validation,dep_pool_markrepair, and publish work rather than replay-heavy dependency reconstruction*_unmodifiedruntime duplicationRemaining Work / Risk
mainand resolve the manual-scope runtime conflicts cleanlymanual_dep=trueremains a sharp tool and is only safe when ordering/frontier requirements are already covered by other logicTesting
python -m pytest tests/ut/test_runtime_builder.py -qpython -m pytest tests/ut/test_manual_scope_boundary.py tests/ut/test_manual_scope_guards.py -qdocs/manual-dep-for-tensormap-design.mdtmp/bench_matrix_20260409_0006_direct/results.csv