Add manual-scope dependency mode to tensormap runtime by uv-xiao · Pull Request #482 · hw-native-sys/simpler

uv-xiao · 2026-04-08T11:18:33Z

Summary

add PTO2ScopeMode so PTO2_SCOPE() stays AUTO by default and PTO2_SCOPE(PTO2ScopeMode::MANUAL) enables explicit same-scope dependencies
add manual submit and dependency APIs for the tensormap_and_ringbuffer runtime while keeping cross-scope dependency discovery on TensorMap
add partial-manual paged-attention scenes, guard/boundary regression coverage, benchmark support, and doc updates anchored in docs/manual-dep-for-tensormap-design.md
keep the untouched *_unmodified baseline out of the branch and treat it as a worktree-only comparison point

Design Reference

The major design reference for this draft is:

docs/manual-dep-for-tensormap-design.md

The intended model is:

same-scope dependency tracking inside manual scope: explicit
cross-scope dependency tracking: TensorMap plus owner retention
scope-local lifetime semantics: unchanged

This is intentionally not a full port of aicpu_build_graph into the PTO2 runtime.

Current Status

The original non-unroll paged-attention performance target is now met on the fresh rerun set recorded in the design note.

Fresh reruns with a2a3, device 6, -n 5, and -c 6622890:

paged_attention / Case1
- aicpu_build_graph = 31037.8 us
- tensormap_and_ringbuffer_unmodified = 36992.8 us (orch 36991.9)
- tensormap_and_ringbuffer = 36791.2 us (orch 36790.5)
- tensormap_and_ringbuffer_partial_manual = 31563.9 us (orch 31407.2)
paged_attention / Case2
- aicpu_build_graph = 16719.2 us
- tensormap_and_ringbuffer_unmodified = 18753.6 us (orch 18752.8)
- tensormap_and_ringbuffer = 18615.9 us (orch 18615.1)
- tensormap_and_ringbuffer_partial_manual = 16757.6 us (orch 16343.9)

That puts partial-manual within about +1.7% / +0.2% of aicpu_build_graph on the target non-unroll workload, while the AUTO path remains effectively zero-overhead versus the untouched tensormap baseline.

This PR is still draft because the branch now needs a merge-forward onto current main: rebasing onto upstream/main hits conflicts in the same runtime/manual-scope files and should be resolved before marking the PR ready.

What This PR Changes

runtime surface:
- enum-based scope mode selection with AUTO default
- manual submit path returning task ids for explicit same-scope wiring
- explicit pto2_rt_add_dependency(...) inside manual scopes
runtime semantics:
- AUTO path stays the default and is intended to remain zero-overhead outside manual mode
- manual scopes keep explicit same-scope dependencies while preserving TensorMap behavior across scope boundaries
- manual scope_end() has been reduced to validation, dep_pool_mark repair, and publish work rather than replay-heavy dependency reconstruction
examples and tests:
- partial-manual paged-attention scenes for non-unroll and unroll
- negative guard coverage for nested manual scopes, manual tensor access, and self-dependency
- outer-write boundary regression coverage
tooling and docs:
- benchmark script support for the partial-manual scenes
- updated docs for manual-dep design and example layout rules
- cleanup of branch-local *_unmodified runtime duplication

Remaining Work / Risk

rebase the branch onto current main and resolve the manual-scope runtime conflicts cleanly
hardware benchmarking is inherently noisy across busy devices; the design doc now records the fresh matrix and the exact rerun settings used
manual_dep=true remains a sharp tool and is only safe when ordering/frontier requirements are already covered by other logic

Testing

python -m pytest tests/ut/test_runtime_builder.py -q
python -m pytest tests/ut/test_manual_scope_boundary.py tests/ut/test_manual_scope_guards.py -q
direct hardware reruns for the four runtime lanes on paged-attention and paged-attention-unroll, summarized in docs/manual-dep-for-tensormap-design.md
tmp/bench_matrix_20260409_0006_direct/results.csv

gemini-code-assist

Code Review

This pull request introduces manual dependency tracking to the tensormap_and_ringbuffer runtime, allowing for a hybrid model where same-scope dependencies are explicitly defined while cross-scope relations continue to use TensorMap discovery. The implementation includes significant updates to the orchestrator and scheduler to handle deferred task publication and manual edge recording, as well as a repository-wide restructuring of examples and tests to follow a new {arch}/{runtime} directory convention. Review feedback correctly identified critical issues with bitmask widths and bitwise shifts that could lead to overflows when tracking more than 16 tensor arguments. Suggestions were also made to improve memory safety during buffer reallocation and to simplify logic using standard library functions.

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp

...and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp

uv-xiao · 2026-04-08T12:36:19Z

Reviewer Guide: what this PR adds, how it works, and what still needs work

This draft adds a manual dependency mode to tensormap_and_ringbuffer without turning the runtime into a full aicpu_build_graph clone.

The design goal is a hybrid model:

Case	Dependency source
Same-scope tensors created and reused inside a manual scope	explicit `add_dependency(...)`
Cross-scope / outer tensors	existing TensorMap + owner retention path
Lifetime / ring ownership	unchanged scope semantics

So the mental model is:

manual scope controls same-scope ordering
TensorMap still protects scope boundaries
AUTO mode remains the default

1. User-facing feature: explicit manual scope mode

PTO2_SCOPE() still means AUTO mode.

This PR adds an enum-based manual mode:

PTO2_SCOPE(PTO2ScopeMode::MANUAL) {
    ...
}

and separate manual submit APIs that return task ids:

PTO2ManualSubmitResult qk = pto2_rt_submit_aic_task_manual(FUNC_QK_MATMUL, args_qk);
PTO2ManualSubmitResult pv = pto2_rt_submit_aic_task_manual(FUNC_PV_MATMUL, args_pv);
pto2_rt_add_dependency(qk.task_id, pv.task_id);

Why separate APIs instead of changing TaskOutputTensors?

existing AUTO call sites stay unchanged
manual mode can expose task ids cleanly
tensor representation stays unified

2. What changed semantically

The important semantic split is:

inside manual scope:
- tasks are recorded
- explicit same-scope edges are recorded
- scheduler publication is deferred to scope_end
at the boundary:
- outer tensor dependency discovery still happens through TensorMap / owner_task_id
- outer tensor writer frontier still updates during submit, not only at scope_end

That means this is not “disable TensorMap in manual scope”.

That would be wrong for cross-scope correctness.

3. Visual flow

AUTO scope today
---------------
submit task
  -> discover deps from owner + TensorMap
  -> wire scheduler fanin/fanout now
  -> publish now
  -> scope_end only releases lifetime ref

MANUAL scope in this PR
-----------------------
submit task inside PTO2_SCOPE(MANUAL)
  -> classify each tensor
       manual-local?  yes -> same-scope ordering must come from add_dependency
                      no  -> keep owner/TensorMap boundary discovery
  -> cache external fanins now
  -> publish outer writer frontier now
  -> record explicit same-scope edges only
  -> do NOT publish task yet

manual scope_end
  -> replay explicit same-scope edges
  -> merge with cached external fanins
  -> dedup producer set
  -> realize scheduler fanin/fanout
  -> batch publish tasks
  -> release scope lifetime ref

4. Minimal example

A small example of the intended usage pattern:

PTO2_SCOPE() {
    Tensor out = make_tensor_external(out_ptr, out_shape, 2, DataType::FLOAT16);

    PTO2_SCOPE(PTO2ScopeMode::MANUAL) {
        Tensor tmp = create_tensor(tmp_shape, 2, DataType::FLOAT16);

        Arg qk_args;
        qk_args.add_input(q);
        qk_args.add_input(k);
        qk_args.add_output(tmp);
        auto qk = pto2_rt_submit_aic_task_manual(FUNC_QK, qk_args);

        Arg up_args;
        up_args.add_input(tmp);
        up_args.add_inout(out);
        auto up = pto2_rt_submit_aiv_task_manual(FUNC_UPDATE, up_args);

        pto2_rt_add_dependency(qk.task_id, up.task_id);
    }
}

Interpretation:

tmp is manual-local:
- same-scope dependency is explicit
- TensorMap is not needed for tmp -> up
out is outer / cross-scope:
- it still stays on the TensorMap boundary path
- later users outside the manual scope must still see the correct writer frontier

5. `INPUT` / `INOUT` / `OUTPUT` / `OUTPUT_EXISTING` in manual scope

This is the part most likely to confuse reviewers.

Arg kind	Meaning	Manual-scope behavior
`INPUT`	read existing tensor	still seeds dependency from TensorMap / owner if outer
`INOUT`	read-modify-write existing tensor	still gets incoming boundary deps if outer, and still updates writer frontier
`OUTPUT_EXISTING`	overwrite existing outer tensor	no incoming overlap lookup, but still updates outgoing writer frontier
`OUTPUT`	fresh runtime-created tensor	no incoming dependency; later same-scope users must be wired explicitly

Short version:

manual mode only replaces same-scope auto-derivation
manual mode does not remove outer-boundary correctness

6. What state the runtime now maintains

The runtime keeps manual-scope state in a narrow form:

scope_tasks[]: tasks owned by the current scope
manual_edges[]: explicit same-scope producer -> consumer edges
manual_task_meta[]: compact per-task finalize metadata

At manual scope_end, it iterates tasks in submit order and:

merges cached external producers with explicit same-scope edges
dedups them
realizes fanin/fanout into the scheduler
batch-publishes the tasks

This is why the current bottleneck is still concentrated at manual scope_end.

7. What `manual_dep=true` is, and what it is not

This PR also uses manual_dep=true in the partial-manual paged-attention example, but that should be read carefully.

It is not the semantic definition of manual mode.

It is only a per-tensor escape hatch that:

skips TensorMap overlap lookup/insert for that tensor
still keeps creator retention via owner_task_id

In the paged-attention example, the useful optimization was not “mark everything manual”.

The stable gain came mainly from suppressing repeated external-output overlap tracking on out / out_view, where same-scope ordering was already explicit.

8. What was added beyond the runtime APIs

This PR also adds:

partial-manual paged-attention scenes
guard coverage for invalid patterns:
- nested manual scope
- blocking tensor access inside manual scope
- self-dependency
outer-write boundary regression coverage
benchmark support for the partial-manual scenes
doc updates and example-layout cleanup

9. Performance summary

Current state is better than the modified AUTO path on non-unroll paged attention, but still not at the target.

Key branch-local results from the design doc:

Workload	Case	`aicpu_build_graph`	`tensormap_and_ringbuffer`	`partial_manual`
`paged_attention`	`Case1`	`31318.9 us`	`36996.3 us`	`35187.6 us`
`paged_attention`	`Case2`	`16844.5 us`	`19861.8 us`	`18685.5 us`
`paged_attention_unroll`	`Case1`	`1412.7 us`	`1323.9 us`	`1321.3 us`
`paged_attention_unroll`	`Case2`	`705.5 us`	`632.5 us`	`637.5 us`

And the fresh rerun on current branch state gave:

paged_attention / Case1
- aicpu_build_graph = 30441.1 us
- partial_manual = 34961.3 us
paged_attention / Case2
- aicpu_build_graph = 16832.6 us
- partial_manual = 18144.5 us

So the main open problem is still:

non-unroll paged-attention does not yet match aicpu_build_graph
the remaining gap is concentrated in the manual scope_end replay / publish path

10. Why this PR is draft

This draft is ready for design and implementation review, but not yet for “performance target achieved”.

The key thing to keep in mind while reading the code is:

this PR is trying to add explicit same-scope control
while preserving TensorMap boundary correctness
and keeping AUTO mode as the unchanged default path

uv-xiao · 2026-04-08T16:36:47Z

Posting the delta since the last PR update.

What changed since the previous comment

The main runtime hot path has been simplified further:

Before
manual submit
  -> cache boundary/manual bookkeeping
scope_end
  -> replay same-scope deps
  -> rebuild publish state
  -> publish

Now
manual submit
  -> discover boundary deps now
  -> wire same-scope explicit deps now
scope_end
  -> validate
  -> dep_pool_mark fixup
  -> publish

Concretely, the recent changes were:

moved manual external/boundary wiring to submit time
collapsed manual-scope bookkeeping so same-scope explicit edges no longer wait for a replay-heavy scope_end
removed dead manual bookkeeping state from the orchestrator
collapsed manual scope_end validation + dep_pool_mark repair into one pass
refreshed the design note so it matches the implemented runtime, not the older replay model

The intended mental model is now:

MANUAL scope

manual-local tensor
  -> skip TensorMap
  -> explicit pto2_rt_add_dependency(...)

boundary tensor
  -> owner retention
  -> TensorMap lookup / insert

scope_end
  -> small publish barrier only

Fresh comparison

Fresh rerun settings:

platform: a2a3
device: 6
rounds: 5
PTO-ISA commit: 6622890

Units are elapsed_us (orch_us).
aicpu_build_graph does not emit the same orch timing lines, so only elapsed is shown there.

`paged_attention`

Case	`aicpu_build_graph`	`tensormap_and_ringbuffer_unmodified`	`tensormap_and_ringbuffer`	`tensormap_and_ringbuffer_partial_manual`
`Case1`	`31037.8`	`36992.8 (36991.9)`	`36791.2 (36790.5)`	`31563.9 (31407.2)`
`Case2`	`16719.2`	`18753.6 (18752.8)`	`18615.9 (18615.1)`	`16757.6 (16343.9)`

`paged_attention_unroll`

Case	`aicpu_build_graph`	`tensormap_and_ringbuffer_unmodified`	`tensormap_and_ringbuffer`	`tensormap_and_ringbuffer_partial_manual`
`Case1`	`1421.2`	`1320.0 (853.6)`	`1322.5 (820.0)`	`1327.0 (835.5)`
`Case2`	`707.8`	`632.5 (383.5)`	`635.9 (391.8)`	`633.7 (365.5)`

What the new numbers mean

AUTO remains effectively zero-overhead versus the untouched tensormap baseline.
- paged_attention/Case1: 36791.2 vs 36992.8 (-0.5%)
- paged_attention/Case2: 18615.9 vs 18753.6 (-0.7%)
Partial-manual now closes the non-unroll gap to aicpu_build_graph.
- Case1: 31563.9 vs 31037.8 (+1.7%)
- Case2: 16757.6 vs 16719.2 (+0.2%)
On paged_attention_unroll, the AUTO path was already amortizing most of the orchestration cost, so partial-manual brings little extra benefit there. That is expected.

Bottom line

The previous PR comment said the non-unroll target was still not met because manual mode was paying a large serial scope_end replay/publish cost.

That is no longer the case on the fresh rerun set above:

AUTO keeps the zero-overhead property
partial-manual is now in the same performance band as aicpu_build_graph on non-unroll paged attention
the design doc has been updated to match the current implementation and benchmark matrix

- Capture the hybrid scoped model for tensormap_and_ringbuffer - Define same-scope explicit edges versus cross-scope TensorMap behavior - Record ownership, scope, nesting, and testing constraints before implementation

- Force outer-scope reads in manual scope through TensorMap boundary seeding - Remove the invalid inner-created outer-alias case and keep Tensor layout unchanged - Add explicit scope, tooling, and narrow-change requirements for the implementation PR

- add PTO2ScopeMode::MANUAL, manual submit APIs, and deferred\n scope_end replay for tensormap_and_ringbuffer\n- add paged_attention_partial_manual plus paged_attention*_partial_manual\n ST coverage for nested outer-normal and inner-manual scopes\n- repoint AGENTS.md/CLAUDE.md toward the .agents layout and add a\n placeholder so the directory is tracked

- add a benchmark_rounds runtime selector for the partial-manual scenes\n- keep the current tensormap runtime on the direct selector path\n- record fresh 2026-04-06 hardware comparison data in the design doc

- replace the interim comparison table with the newest fresh device-2 reruns\n- keep the benchmark workflow section aligned with the partial-manual selector\n- record the remaining AUTO-path and partial-manual performance gaps

- cache manual-local tensor classification to avoid repeated scope scans - fuse manual publish and scope-end release into one scheduler pass - limit manual scope sync to active rings and keep submit-path work deferred - chunk paged_attention_partial_manual scopes and add carry deps between updates

Reduce the manual-scope chunk size in the heavy paged-attention\nscene and drop the extra cross-update dependency chain. The\nprevious chunking shape can deadlock the dep pool under benchmark\nload, while the smaller chunk keeps the benchmark path stable and\nlowers the manual-scope overhead.

Manual submit must stay a cheap metadata-recording path and defer\nTensorMap lookup/insert plus dep-pool fanin wiring to manual\nscope_end.\n\nThis reverts the duplicated submit-time work from 0fd6fbc and\nrestores the separate publish/on_scope_end order so manual scopes\ndo not attach fanout edges after releasing their scope reference.

- update the manual dependency design doc to make submit-time boundary discovery and TensorMap-free manual scope_end explicit - cache and retain external producers during manual submit, then merge them with explicit manual edges at publish time - keep the heavy manual paged-attention benchmark on device 3 moving in the right direction without changing example code

- dedupe explicit manual edges when they are recorded and keep an exact incoming edge count per consumer - append local explicit producers directly at scope_end and skip lock/task_state checks for unpublished same-scope producers - keep overflow validation and dep-pool publish ordering unchanged so the optimization stays within existing scheduler invariants

- rewrite the non-unroll partial-manual example to use one manual scope per q tile - move the hub allocation into the manual scope and serialize update tasks explicitly - drop the chunked manual-scope pattern that inflated partial-manual orchestration cost

- mark external inputs and outputs as manual-dep boundaries in the non-unroll partial-manual paged attention example - skip repeated overlap tracking for query, kv-cache, and final output views that are already ordered by the explicit manual dependency chain - keep the manual-scope methodology while improving orch time on real device for both paged-attention cases

- replace the stale benchmark section with fresh 2026-04-08 device-2 results - document how the benchmark wrapper selects new, unmodified, and partial-manual variants - record that partial-manual improved on non-unroll but still misses the aicpu_build_graph target

- replace stale benchmark data in the design doc with fresh device-3 measurements for the four paged-attention runtime lanes - document how benchmark_rounds.sh selects the partial-manual scenes - record the non-unroll boundary-hint A/B results and the safety limits of using manual_dep=true as an example-level boundary annotation

- explain submit-time versus scope-end work in the manual-scope path - document how in-scope and cross-scope tensors are classified and handled - bind the kept example optimizations to measured orch gains and remove stale scope-end frontier wording

- add a short rationale section tying each manual-scope rule to the incorrect or too-expensive alternative it avoids - keep the design doc aligned with the current implemented split between TensorMap boundary discovery and scope-end explicit-edge replay

- delete the copied tensormap_and_ringbuffer_unmodified runtime and ST scenes\n- keep branch docs and benchmark helpers limited to supported runtimes\n- enforce examples/{arch}/{runtime}/{name} in tracked command docs\n- rewrite example and ST path references to use explicit arch prefixes

- add a small hardware test helper that respects PTO_TEST_DEVICE_ID\n- fall back to the lowest NPU with no running processes from npu-smi\n- avoid blocking manual-scope hardware tests behind busy device 0 on shared machines

- widen manual-local masks to 64 bits so manual submit handles all\n MAX_TENSOR_ARGS entries without truncation\n- keep realloc failures from dropping live metadata buffers by setting a\n fatal out-of-memory runtime error before returning\n- simplify the partial-manual paged-attention valid_len calculation with\n std::min

- realize external producer fanout during manual submit while keeping the publish barrier at scope_end - shrink manual scope_end to replay only same-scope explicit edges - keep manual-scope validation and boundary semantics unchanged

- remove dead manual replay metadata from the manual-scope path - skip tensormap sync when a manual submit stays fully in-scope - keep only the publish barrier and dep-pool watermark fixup at scope_end On device 4 with 5 rounds, paged_attention_partial_manual improved from 35.27 ms / 35.12 ms orch to 31.60 ms / 31.44 ms orch for Case1, and from 19.80 ms / 19.38 ms orch to 17.93 ms / 17.49 ms orch for Case2.

- rewrite the design note to match the current manual submit and publish-only\n scope_end implementation\n- record the fresh four-way paged-attention comparison and benchmark\n entrypoints, including the detached worktree flow for the old runtime\n- remove the dead manual_dep_pool_reserve state from the orchestrator

- replace stale replay-heavy scope_end description with the current\n submit-time wiring model\n- document how AUTO, MANUAL, and benchmark selectors map onto the\n paged-attention scenes\n- record the fresh device-6 four-runtime comparison and the observed\n gains for partial-manual and zero-overhead AUTO

Merge manual scope-end validation and dep-pool watermark repair into a\nsingle pass.\n\nThis keeps the manual publish path behavior unchanged while trimming one\nserial walk over the scope task list.

- repair the rebased tensormap submit prologue and task-id write-back\n- restore partial-manual hub kernel sources under the example trees\n- repoint the partial-manual configs so hardware benchmarks build again

- update the rebased partial-manual unroll orchestration to match the\n current qk/pv kernel ABI by passing block_table as a tensor plus\n bt_offset as a scalar\n- rerun the device validation for both unroll cases after the fix\n- refresh the design doc with the rebase root cause and the new 4-way\n benchmark results on device 4 with PTO-ISA d96c8784

uv-xiao · 2026-04-10T02:39:49Z

Posting the current merge-forward progress after force-updating the PR head to e5fa1bc.

Rebase status

This PR is now rebased onto current main (fb43ed3).

The history was rewritten on purpose: the old PR head was based on an older mainline, while the current branch replays the manual-scope stack onto the newest runtime/sim/test infrastructure.

old manual-scope stack
    |
    | rebase onto new main
    v
new main runtime/sim/test APIs
    +
manual-scope feature stack reapplied
    +
rebase-only fixes for drifted examples / scripts

What remains the actual feature work

These are the branch changes that are still the core of this PR:

add enum-based scope mode selection, with AUTO as default and PTO2_SCOPE(PTO2ScopeMode::MANUAL) for explicit same-scope dependency control
add manual submit APIs plus explicit pto2_rt_add_dependency(...)
keep AUTO on the zero-overhead path
keep boundary / cross-scope correctness on the TensorMap + owner-retention path
move the expensive manual replay logic out of the old scope_end-heavy model, so boundary discovery happens at submit time and scope_end is a much smaller publish barrier
add guard tests, boundary tests, partial-manual paged-attention scenes, benchmark support, and the detailed design/benchmark note in docs/manual-dep-for-tensormap-design.md

What changed because of the rebase to newest main

These are not new feature goals; they are alignment work required to land the branch on top of current main:

adapt the branch to the newer runtime/simulation substrate already merged in main
- handle-based DeviceRunner / runtime API refactors
- CPU-sim / TPUSH-TPOP isolation changes
- newer test and example harness layout
adapt the paged-attention examples to the newer mainline kernel ABI
- mainline paged-attention orchestration now passes block_table as a tensor and bt_offset as a scalar
clean up branch-only scaffolding that should not stay in the PR branch
- the unmodified tensormap baseline stays worktree-only for benchmarking and is no longer carried in this branch
refresh benchmark paths / scripts / docs so they match the post-rebase tree

Rebase-specific fix in the current tip

The latest commit (e5fa1bc, Fix: align rebased unroll partial-manual ABI) fixes a rebase-induced example mismatch, not a manual-scope runtime bug.

The issue was:

main changed paged_attention_unroll kernel ABI
    -> partial-manual unroll example still used old argument pattern
    -> rebased scene timed out / failed

The fix updates the partial-manual unroll orchestration to pass the same logical inputs the rebased mainline kernels now expect:

block_table as a tensor input
bt_offset as the per-block scalar offset

So the failure came from stale example-side orchestration after the rebase, not from the manual-scope dependency runtime itself.

Verification after the rebase fix

Verified on real device after the ABI fix:

paged_attention_unroll_partial_manual / Case1: pass
paged_attention_unroll_partial_manual / Case2: pass

The design note has also been refreshed with the post-rebase benchmark commands and current comparison tables.

Reviewer-facing takeaway

The important thing to read in this force-push is:

the manual-scope feature set is still the same product change
part of the diff churn comes from rebasing onto a significantly newer main
the latest tip mainly resolves rebase drift in examples / scripts so the rebased branch is runnable again on current main

Fresh rebased performance table

Settings:

platform: a2a3
device: 4
rounds: 5
PTO-ISA commit: d96c8784

Units are elapsed_us (orch_us). aicpu_build_graph does not emit the same orch timing lines, so only elapsed time is shown there.

`paged_attention`

Case	`aicpu_build_graph`	`tensormap_and_ringbuffer_unmodified`	`tensormap_and_ringbuffer`	`tensormap_and_ringbuffer_partial_manual`
`Case1`	`29937.7`	`36095.9 (36094.9)`	`39148.7 (39148.3)`	`34186.3 (34025.7)`
`Case2`	`16762.7`	`18639.5 (18635.1)`	`19813.0 (19812.7)`	`18028.7 (17618.4)`

`paged_attention_unroll`

Case	`aicpu_build_graph`	`tensormap_and_ringbuffer_unmodified`	`tensormap_and_ringbuffer`	`tensormap_and_ringbuffer_partial_manual`
`Case1`	`1425.3`	`1325.6 (835.3)`	`1173.2 (992.0)`	`1160.4 (968.8)`
`Case2`	`693.0`	`628.7 (380.7)`	`567.9 (435.6)`	`561.9 (416.6)`

gemini-code-assist bot reviewed Apr 8, 2026

View reviewed changes

uv-xiao mentioned this pull request Apr 9, 2026

[Bug] Rebased partial-manual manual scope deadlocks on paged_attention while AUTO mode remains healthy #495

Open

uv-xiao and others added 26 commits April 9, 2026 23:22

Add: document manual dependency scope design

876a553

- Capture the hybrid scoped model for tensormap_and_ringbuffer - Define same-scope explicit edges versus cross-scope TensorMap behavior - Record ownership, scope, nesting, and testing constraints before implementation

docs: refine manual tensormap dependency design

23a1fe2

Add unmodified tensormap runtime baseline

54e5324

Restore zero-overhead auto path

bf1fc84

Restore zero-overhead auto scope path

c4b5fa8

Add manual scope guard regression tests

5b10117

Harden manual scope guard coverage

0bb19c9

Add manual scope outer-write boundary test

58d3c1a

Support: add partial-manual benchmark selector

eed20b5

- add a benchmark_rounds runtime selector for the partial-manual scenes\n- keep the current tensormap runtime on the direct selector path\n- record fresh 2026-04-06 hardware comparison data in the design doc

Update: refresh manual-dep benchmark data

a3b96ba

- replace the interim comparison table with the newest fresh device-2 reruns\n- keep the benchmark workflow section aligned with the partial-manual selector\n- record the remaining AUTO-path and partial-manual performance gaps

Fix: remove manual scope membership scan

431a9ea

Fix: auto-pick free NPU for manual-scope tests

c72fa9d

- add a small hardware test helper that respects PTO_TEST_DEVICE_ID\n- fall back to the lowest NPU with no running processes from npu-smi\n- avoid blocking manual-scope hardware tests behind busy device 0 on shared machines

uv-xiao added 9 commits April 10, 2026 00:08

Update: move manual external wiring to submit

abace34

- realize external producer fanout during manual submit while keeping the publish barrier at scope_end - shrink manual scope_end to replay only same-scope explicit edges - keep manual-scope validation and boundary semantics unchanged

Update: collapse manual scope_end scan

a5332ef

Merge manual scope-end validation and dep-pool watermark repair into a\nsingle pass.\n\nThis keeps the manual publish path behavior unchanged while trimming one\nserial walk over the scope task list.

Support: ignore local worktrees

38bb942

Fix: restore rebased manual benchmark paths

01d0723

- repair the rebased tensormap submit prologue and task-id write-back\n- restore partial-manual hub kernel sources under the example trees\n- repoint the partial-manual configs so hardware benchmarks build again

uv-xiao force-pushed the manual-dep-for-tensormap branch from 30b706a to e5fa1bc Compare April 10, 2026 02:39

uv-xiao marked this pull request as ready for review April 10, 2026 03:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add manual-scope dependency mode to tensormap runtime#482

Add manual-scope dependency mode to tensormap runtime#482
uv-xiao wants to merge 35 commits intohw-native-sys:mainfrom
uv-xiao:manual-dep-for-tensormap

uv-xiao commented Apr 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

uv-xiao commented Apr 8, 2026

Uh oh!

uv-xiao commented Apr 8, 2026

Uh oh!

uv-xiao commented Apr 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

uv-xiao commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design Reference

Current Status

What This PR Changes

Remaining Work / Risk

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

uv-xiao commented Apr 8, 2026

Reviewer Guide: what this PR adds, how it works, and what still needs work

1. User-facing feature: explicit manual scope mode

2. What changed semantically

3. Visual flow

4. Minimal example

5. INPUT / INOUT / OUTPUT / OUTPUT_EXISTING in manual scope

6. What state the runtime now maintains

7. What manual_dep=true is, and what it is not

8. What was added beyond the runtime APIs

9. Performance summary

10. Why this PR is draft

Uh oh!

uv-xiao commented Apr 8, 2026

What changed since the previous comment

Fresh comparison

paged_attention

paged_attention_unroll

What the new numbers mean

Bottom line

Uh oh!

uv-xiao commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rebase status

What remains the actual feature work

What changed because of the rebase to newest main

Rebase-specific fix in the current tip

Verification after the rebase fix

Reviewer-facing takeaway

Fresh rebased performance table

paged_attention

paged_attention_unroll

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

uv-xiao commented Apr 8, 2026 •

edited

Loading

5. `INPUT` / `INOUT` / `OUTPUT` / `OUTPUT_EXISTING` in manual scope

7. What `manual_dep=true` is, and what it is not

`paged_attention`

`paged_attention_unroll`

uv-xiao commented Apr 10, 2026 •

edited

Loading

`paged_attention`

`paged_attention_unroll`