Skip to content

Add manual-scope dependency mode to tensormap runtime#482

Open
uv-xiao wants to merge 35 commits intohw-native-sys:mainfrom
uv-xiao:manual-dep-for-tensormap
Open

Add manual-scope dependency mode to tensormap runtime#482
uv-xiao wants to merge 35 commits intohw-native-sys:mainfrom
uv-xiao:manual-dep-for-tensormap

Conversation

@uv-xiao
Copy link
Copy Markdown
Contributor

@uv-xiao uv-xiao commented Apr 8, 2026

Summary

  • add PTO2ScopeMode so PTO2_SCOPE() stays AUTO by default and PTO2_SCOPE(PTO2ScopeMode::MANUAL) enables explicit same-scope dependencies
  • add manual submit and dependency APIs for the tensormap_and_ringbuffer runtime while keeping cross-scope dependency discovery on TensorMap
  • add partial-manual paged-attention scenes, guard/boundary regression coverage, benchmark support, and doc updates anchored in docs/manual-dep-for-tensormap-design.md
  • keep the untouched *_unmodified baseline out of the branch and treat it as a worktree-only comparison point

Design Reference

The major design reference for this draft is:

  • docs/manual-dep-for-tensormap-design.md

The intended model is:

  • same-scope dependency tracking inside manual scope: explicit
  • cross-scope dependency tracking: TensorMap plus owner retention
  • scope-local lifetime semantics: unchanged

This is intentionally not a full port of aicpu_build_graph into the PTO2 runtime.

Current Status

The original non-unroll paged-attention performance target is now met on the fresh rerun set recorded in the design note.

Fresh reruns with a2a3, device 6, -n 5, and -c 6622890:

  • paged_attention / Case1
    • aicpu_build_graph = 31037.8 us
    • tensormap_and_ringbuffer_unmodified = 36992.8 us (orch 36991.9)
    • tensormap_and_ringbuffer = 36791.2 us (orch 36790.5)
    • tensormap_and_ringbuffer_partial_manual = 31563.9 us (orch 31407.2)
  • paged_attention / Case2
    • aicpu_build_graph = 16719.2 us
    • tensormap_and_ringbuffer_unmodified = 18753.6 us (orch 18752.8)
    • tensormap_and_ringbuffer = 18615.9 us (orch 18615.1)
    • tensormap_and_ringbuffer_partial_manual = 16757.6 us (orch 16343.9)

That puts partial-manual within about +1.7% / +0.2% of aicpu_build_graph on the target non-unroll workload, while the AUTO path remains effectively zero-overhead versus the untouched tensormap baseline.

This PR is still draft because the branch now needs a merge-forward onto current main: rebasing onto upstream/main hits conflicts in the same runtime/manual-scope files and should be resolved before marking the PR ready.

What This PR Changes

  • runtime surface:
    • enum-based scope mode selection with AUTO default
    • manual submit path returning task ids for explicit same-scope wiring
    • explicit pto2_rt_add_dependency(...) inside manual scopes
  • runtime semantics:
    • AUTO path stays the default and is intended to remain zero-overhead outside manual mode
    • manual scopes keep explicit same-scope dependencies while preserving TensorMap behavior across scope boundaries
    • manual scope_end() has been reduced to validation, dep_pool_mark repair, and publish work rather than replay-heavy dependency reconstruction
  • examples and tests:
    • partial-manual paged-attention scenes for non-unroll and unroll
    • negative guard coverage for nested manual scopes, manual tensor access, and self-dependency
    • outer-write boundary regression coverage
  • tooling and docs:
    • benchmark script support for the partial-manual scenes
    • updated docs for manual-dep design and example layout rules
    • cleanup of branch-local *_unmodified runtime duplication

Remaining Work / Risk

  • rebase the branch onto current main and resolve the manual-scope runtime conflicts cleanly
  • hardware benchmarking is inherently noisy across busy devices; the design doc now records the fresh matrix and the exact rerun settings used
  • manual_dep=true remains a sharp tool and is only safe when ordering/frontier requirements are already covered by other logic

Testing

  • python -m pytest tests/ut/test_runtime_builder.py -q
  • python -m pytest tests/ut/test_manual_scope_boundary.py tests/ut/test_manual_scope_guards.py -q
  • direct hardware reruns for the four runtime lanes on paged-attention and paged-attention-unroll, summarized in docs/manual-dep-for-tensormap-design.md
  • tmp/bench_matrix_20260409_0006_direct/results.csv

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces manual dependency tracking to the tensormap_and_ringbuffer runtime, allowing for a hybrid model where same-scope dependencies are explicitly defined while cross-scope relations continue to use TensorMap discovery. The implementation includes significant updates to the orchestrator and scheduler to handle deferred task publication and manual edge recording, as well as a repository-wide restructuring of examples and tests to follow a new {arch}/{runtime} directory convention. Review feedback correctly identified critical issues with bitmask widths and bitwise shifts that could lead to overflows when tracking more than 16 tensor arguments. Suggestions were also made to improve memory safety during buffer reallocation and to simplify logic using standard library functions.

@uv-xiao
Copy link
Copy Markdown
Contributor Author

uv-xiao commented Apr 8, 2026

Reviewer Guide: what this PR adds, how it works, and what still needs work

This draft adds a manual dependency mode to tensormap_and_ringbuffer without turning the runtime into a full aicpu_build_graph clone.

The design goal is a hybrid model:

Case Dependency source
Same-scope tensors created and reused inside a manual scope explicit add_dependency(...)
Cross-scope / outer tensors existing TensorMap + owner retention path
Lifetime / ring ownership unchanged scope semantics

So the mental model is:

  • manual scope controls same-scope ordering
  • TensorMap still protects scope boundaries
  • AUTO mode remains the default

1. User-facing feature: explicit manual scope mode

PTO2_SCOPE() still means AUTO mode.

This PR adds an enum-based manual mode:

PTO2_SCOPE(PTO2ScopeMode::MANUAL) {
    ...
}

and separate manual submit APIs that return task ids:

PTO2ManualSubmitResult qk = pto2_rt_submit_aic_task_manual(FUNC_QK_MATMUL, args_qk);
PTO2ManualSubmitResult pv = pto2_rt_submit_aic_task_manual(FUNC_PV_MATMUL, args_pv);
pto2_rt_add_dependency(qk.task_id, pv.task_id);

Why separate APIs instead of changing TaskOutputTensors?

  • existing AUTO call sites stay unchanged
  • manual mode can expose task ids cleanly
  • tensor representation stays unified

2. What changed semantically

The important semantic split is:

  • inside manual scope:
    • tasks are recorded
    • explicit same-scope edges are recorded
    • scheduler publication is deferred to scope_end
  • at the boundary:
    • outer tensor dependency discovery still happens through TensorMap / owner_task_id
    • outer tensor writer frontier still updates during submit, not only at scope_end

That means this is not “disable TensorMap in manual scope”.

That would be wrong for cross-scope correctness.

3. Visual flow

AUTO scope today
---------------
submit task
  -> discover deps from owner + TensorMap
  -> wire scheduler fanin/fanout now
  -> publish now
  -> scope_end only releases lifetime ref

MANUAL scope in this PR
-----------------------
submit task inside PTO2_SCOPE(MANUAL)
  -> classify each tensor
       manual-local?  yes -> same-scope ordering must come from add_dependency
                      no  -> keep owner/TensorMap boundary discovery
  -> cache external fanins now
  -> publish outer writer frontier now
  -> record explicit same-scope edges only
  -> do NOT publish task yet

manual scope_end
  -> replay explicit same-scope edges
  -> merge with cached external fanins
  -> dedup producer set
  -> realize scheduler fanin/fanout
  -> batch publish tasks
  -> release scope lifetime ref

4. Minimal example

A small example of the intended usage pattern:

PTO2_SCOPE() {
    Tensor out = make_tensor_external(out_ptr, out_shape, 2, DataType::FLOAT16);

    PTO2_SCOPE(PTO2ScopeMode::MANUAL) {
        Tensor tmp = create_tensor(tmp_shape, 2, DataType::FLOAT16);

        Arg qk_args;
        qk_args.add_input(q);
        qk_args.add_input(k);
        qk_args.add_output(tmp);
        auto qk = pto2_rt_submit_aic_task_manual(FUNC_QK, qk_args);

        Arg up_args;
        up_args.add_input(tmp);
        up_args.add_inout(out);
        auto up = pto2_rt_submit_aiv_task_manual(FUNC_UPDATE, up_args);

        pto2_rt_add_dependency(qk.task_id, up.task_id);
    }
}

Interpretation:

  • tmp is manual-local:
    • same-scope dependency is explicit
    • TensorMap is not needed for tmp -> up
  • out is outer / cross-scope:
    • it still stays on the TensorMap boundary path
    • later users outside the manual scope must still see the correct writer frontier

5. INPUT / INOUT / OUTPUT / OUTPUT_EXISTING in manual scope

This is the part most likely to confuse reviewers.

Arg kind Meaning Manual-scope behavior
INPUT read existing tensor still seeds dependency from TensorMap / owner if outer
INOUT read-modify-write existing tensor still gets incoming boundary deps if outer, and still updates writer frontier
OUTPUT_EXISTING overwrite existing outer tensor no incoming overlap lookup, but still updates outgoing writer frontier
OUTPUT fresh runtime-created tensor no incoming dependency; later same-scope users must be wired explicitly

Short version:

  • manual mode only replaces same-scope auto-derivation
  • manual mode does not remove outer-boundary correctness

6. What state the runtime now maintains

The runtime keeps manual-scope state in a narrow form:

  • scope_tasks[]: tasks owned by the current scope
  • manual_edges[]: explicit same-scope producer -> consumer edges
  • manual_task_meta[]: compact per-task finalize metadata

At manual scope_end, it iterates tasks in submit order and:

  1. merges cached external producers with explicit same-scope edges
  2. dedups them
  3. realizes fanin/fanout into the scheduler
  4. batch-publishes the tasks

This is why the current bottleneck is still concentrated at manual scope_end.

7. What manual_dep=true is, and what it is not

This PR also uses manual_dep=true in the partial-manual paged-attention example, but that should be read carefully.

It is not the semantic definition of manual mode.

It is only a per-tensor escape hatch that:

  • skips TensorMap overlap lookup/insert for that tensor
  • still keeps creator retention via owner_task_id

In the paged-attention example, the useful optimization was not “mark everything manual”.

The stable gain came mainly from suppressing repeated external-output overlap tracking on out / out_view, where same-scope ordering was already explicit.

8. What was added beyond the runtime APIs

This PR also adds:

  • partial-manual paged-attention scenes
  • guard coverage for invalid patterns:
    • nested manual scope
    • blocking tensor access inside manual scope
    • self-dependency
  • outer-write boundary regression coverage
  • benchmark support for the partial-manual scenes
  • doc updates and example-layout cleanup

9. Performance summary

Current state is better than the modified AUTO path on non-unroll paged attention, but still not at the target.

Key branch-local results from the design doc:

Workload Case aicpu_build_graph tensormap_and_ringbuffer partial_manual
paged_attention Case1 31318.9 us 36996.3 us 35187.6 us
paged_attention Case2 16844.5 us 19861.8 us 18685.5 us
paged_attention_unroll Case1 1412.7 us 1323.9 us 1321.3 us
paged_attention_unroll Case2 705.5 us 632.5 us 637.5 us

And the fresh rerun on current branch state gave:

  • paged_attention / Case1
    • aicpu_build_graph = 30441.1 us
    • partial_manual = 34961.3 us
  • paged_attention / Case2
    • aicpu_build_graph = 16832.6 us
    • partial_manual = 18144.5 us

So the main open problem is still:

  • non-unroll paged-attention does not yet match aicpu_build_graph
  • the remaining gap is concentrated in the manual scope_end replay / publish path

10. Why this PR is draft

This draft is ready for design and implementation review, but not yet for “performance target achieved”.

The key thing to keep in mind while reading the code is:

  • this PR is trying to add explicit same-scope control
  • while preserving TensorMap boundary correctness
  • and keeping AUTO mode as the unchanged default path

@uv-xiao
Copy link
Copy Markdown
Contributor Author

uv-xiao commented Apr 8, 2026

Posting the delta since the last PR update.

What changed since the previous comment

The main runtime hot path has been simplified further:

Before
manual submit
  -> cache boundary/manual bookkeeping
scope_end
  -> replay same-scope deps
  -> rebuild publish state
  -> publish

Now
manual submit
  -> discover boundary deps now
  -> wire same-scope explicit deps now
scope_end
  -> validate
  -> dep_pool_mark fixup
  -> publish

Concretely, the recent changes were:

  • moved manual external/boundary wiring to submit time
  • collapsed manual-scope bookkeeping so same-scope explicit edges no longer wait for a replay-heavy scope_end
  • removed dead manual bookkeeping state from the orchestrator
  • collapsed manual scope_end validation + dep_pool_mark repair into one pass
  • refreshed the design note so it matches the implemented runtime, not the older replay model

The intended mental model is now:

MANUAL scope

manual-local tensor
  -> skip TensorMap
  -> explicit pto2_rt_add_dependency(...)

boundary tensor
  -> owner retention
  -> TensorMap lookup / insert

scope_end
  -> small publish barrier only

Fresh comparison

Fresh rerun settings:

  • platform: a2a3
  • device: 6
  • rounds: 5
  • PTO-ISA commit: 6622890

Units are elapsed_us (orch_us).
aicpu_build_graph does not emit the same orch timing lines, so only elapsed is shown there.

paged_attention

Case aicpu_build_graph tensormap_and_ringbuffer_unmodified tensormap_and_ringbuffer tensormap_and_ringbuffer_partial_manual
Case1 31037.8 36992.8 (36991.9) 36791.2 (36790.5) 31563.9 (31407.2)
Case2 16719.2 18753.6 (18752.8) 18615.9 (18615.1) 16757.6 (16343.9)

paged_attention_unroll

Case aicpu_build_graph tensormap_and_ringbuffer_unmodified tensormap_and_ringbuffer tensormap_and_ringbuffer_partial_manual
Case1 1421.2 1320.0 (853.6) 1322.5 (820.0) 1327.0 (835.5)
Case2 707.8 632.5 (383.5) 635.9 (391.8) 633.7 (365.5)

What the new numbers mean

  • AUTO remains effectively zero-overhead versus the untouched tensormap baseline.

    • paged_attention/Case1: 36791.2 vs 36992.8 (-0.5%)
    • paged_attention/Case2: 18615.9 vs 18753.6 (-0.7%)
  • Partial-manual now closes the non-unroll gap to aicpu_build_graph.

    • Case1: 31563.9 vs 31037.8 (+1.7%)
    • Case2: 16757.6 vs 16719.2 (+0.2%)
  • On paged_attention_unroll, the AUTO path was already amortizing most of the orchestration cost, so partial-manual brings little extra benefit there. That is expected.

Bottom line

The previous PR comment said the non-unroll target was still not met because manual mode was paying a large serial scope_end replay/publish cost.

That is no longer the case on the fresh rerun set above:

  • AUTO keeps the zero-overhead property
  • partial-manual is now in the same performance band as aicpu_build_graph on non-unroll paged attention
  • the design doc has been updated to match the current implementation and benchmark matrix

uv-xiao and others added 26 commits April 9, 2026 23:22
- Capture the hybrid scoped model for tensormap_and_ringbuffer
- Define same-scope explicit edges versus cross-scope TensorMap behavior
- Record ownership, scope, nesting, and testing constraints before implementation
- Force outer-scope reads in manual scope through TensorMap boundary seeding
- Remove the invalid inner-created outer-alias case and keep Tensor layout unchanged
- Add explicit scope, tooling, and narrow-change requirements for the implementation PR
- add PTO2ScopeMode::MANUAL, manual submit APIs, and deferred\n  scope_end replay for tensormap_and_ringbuffer\n- add paged_attention_partial_manual plus paged_attention*_partial_manual\n  ST coverage for nested outer-normal and inner-manual scopes\n- repoint AGENTS.md/CLAUDE.md toward the .agents layout and add a\n  placeholder so the directory is tracked
- add a benchmark_rounds runtime selector for the partial-manual scenes\n- keep the current tensormap runtime on the direct selector path\n- record fresh 2026-04-06 hardware comparison data in the design doc
- replace the interim comparison table with the newest fresh device-2 reruns\n- keep the benchmark workflow section aligned with the partial-manual selector\n- record the remaining AUTO-path and partial-manual performance gaps
- cache manual-local tensor classification to avoid repeated scope scans
- fuse manual publish and scope-end release into one scheduler pass
- limit manual scope sync to active rings and keep submit-path work deferred
- chunk paged_attention_partial_manual scopes and add carry deps between updates
Reduce the manual-scope chunk size in the heavy paged-attention\nscene and drop the extra cross-update dependency chain. The\nprevious chunking shape can deadlock the dep pool under benchmark\nload, while the smaller chunk keeps the benchmark path stable and\nlowers the manual-scope overhead.
Manual submit must stay a cheap metadata-recording path and defer\nTensorMap lookup/insert plus dep-pool fanin wiring to manual\nscope_end.\n\nThis reverts the duplicated submit-time work from 0fd6fbc and\nrestores the separate publish/on_scope_end order so manual scopes\ndo not attach fanout edges after releasing their scope reference.
- update the manual dependency design doc to make submit-time boundary discovery and TensorMap-free manual scope_end explicit
- cache and retain external producers during manual submit, then merge them with explicit manual edges at publish time
- keep the heavy manual paged-attention benchmark on device 3 moving in the right direction without changing example code
- dedupe explicit manual edges when they are recorded and keep an exact incoming edge count per consumer
- append local explicit producers directly at scope_end and skip lock/task_state checks for unpublished same-scope producers
- keep overflow validation and dep-pool publish ordering unchanged so the optimization stays within existing scheduler invariants
- rewrite the non-unroll partial-manual example to use one manual scope per q tile
- move the hub allocation into the manual scope and serialize update tasks explicitly
- drop the chunked manual-scope pattern that inflated partial-manual orchestration cost
- mark external inputs and outputs as manual-dep boundaries in the
  non-unroll partial-manual paged attention example
- skip repeated overlap tracking for query, kv-cache, and final output
  views that are already ordered by the explicit manual dependency chain
- keep the manual-scope methodology while improving orch time on real
  device for both paged-attention cases
- replace the stale benchmark section with fresh 2026-04-08 device-2 results
- document how the benchmark wrapper selects new, unmodified, and partial-manual variants
- record that partial-manual improved on non-unroll but still misses the aicpu_build_graph target
- replace stale benchmark data in the design doc with fresh device-3
  measurements for the four paged-attention runtime lanes
- document how benchmark_rounds.sh selects the partial-manual scenes
- record the non-unroll boundary-hint A/B results and the safety limits of
  using manual_dep=true as an example-level boundary annotation
- explain submit-time versus scope-end work in the manual-scope path
- document how in-scope and cross-scope tensors are classified and handled
- bind the kept example optimizations to measured orch gains and remove
  stale scope-end frontier wording
- add a short rationale section tying each manual-scope rule to the
  incorrect or too-expensive alternative it avoids
- keep the design doc aligned with the current implemented split between
  TensorMap boundary discovery and scope-end explicit-edge replay
- delete the copied tensormap_and_ringbuffer_unmodified runtime and ST scenes\n- keep branch docs and benchmark helpers limited to supported runtimes\n- enforce examples/{arch}/{runtime}/{name} in tracked command docs\n- rewrite example and ST path references to use explicit arch prefixes
- add a small hardware test helper that respects PTO_TEST_DEVICE_ID\n- fall back to the lowest NPU with no running processes from npu-smi\n- avoid blocking manual-scope hardware tests behind busy device 0 on shared machines
uv-xiao added 9 commits April 10, 2026 00:08
- widen manual-local masks to 64 bits so manual submit handles all\n  MAX_TENSOR_ARGS entries without truncation\n- keep realloc failures from dropping live metadata buffers by setting a\n  fatal out-of-memory runtime error before returning\n- simplify the partial-manual paged-attention valid_len calculation with\n  std::min
- realize external producer fanout during manual submit while keeping the publish barrier at scope_end
- shrink manual scope_end to replay only same-scope explicit edges
- keep manual-scope validation and boundary semantics unchanged
- remove dead manual replay metadata from the manual-scope path
- skip tensormap sync when a manual submit stays fully in-scope
- keep only the publish barrier and dep-pool watermark fixup at scope_end

On device 4 with 5 rounds, paged_attention_partial_manual improved
from 35.27 ms / 35.12 ms orch to 31.60 ms / 31.44 ms orch for
Case1, and from 19.80 ms / 19.38 ms orch to 17.93 ms / 17.49 ms
orch for Case2.
- rewrite the design note to match the current manual submit and publish-only\n  scope_end implementation\n- record the fresh four-way paged-attention comparison and benchmark\n  entrypoints, including the detached worktree flow for the old runtime\n- remove the dead manual_dep_pool_reserve state from the orchestrator
- replace stale replay-heavy scope_end description with the current\n  submit-time wiring model\n- document how AUTO, MANUAL, and benchmark selectors map onto the\n  paged-attention scenes\n- record the fresh device-6 four-runtime comparison and the observed\n  gains for partial-manual and zero-overhead AUTO
Merge manual scope-end validation and dep-pool watermark repair into a\nsingle pass.\n\nThis keeps the manual publish path behavior unchanged while trimming one\nserial walk over the scope task list.
- repair the rebased tensormap submit prologue and task-id write-back\n- restore partial-manual hub kernel sources under the example trees\n- repoint the partial-manual configs so hardware benchmarks build again
- update the rebased partial-manual unroll orchestration to match the\n  current qk/pv kernel ABI by passing block_table as a tensor plus\n  bt_offset as a scalar\n- rerun the device validation for both unroll cases after the fix\n- refresh the design doc with the rebase root cause and the new 4-way\n  benchmark results on device 4 with PTO-ISA d96c8784
@uv-xiao uv-xiao force-pushed the manual-dep-for-tensormap branch from 30b706a to e5fa1bc Compare April 10, 2026 02:39
@uv-xiao
Copy link
Copy Markdown
Contributor Author

uv-xiao commented Apr 10, 2026

Posting the current merge-forward progress after force-updating the PR head to e5fa1bc.

Rebase status

This PR is now rebased onto current main (fb43ed3).

The history was rewritten on purpose: the old PR head was based on an older mainline, while the current branch replays the manual-scope stack onto the newest runtime/sim/test infrastructure.

old manual-scope stack
    |
    | rebase onto new main
    v
new main runtime/sim/test APIs
    +
manual-scope feature stack reapplied
    +
rebase-only fixes for drifted examples / scripts

What remains the actual feature work

These are the branch changes that are still the core of this PR:

  • add enum-based scope mode selection, with AUTO as default and PTO2_SCOPE(PTO2ScopeMode::MANUAL) for explicit same-scope dependency control
  • add manual submit APIs plus explicit pto2_rt_add_dependency(...)
  • keep AUTO on the zero-overhead path
  • keep boundary / cross-scope correctness on the TensorMap + owner-retention path
  • move the expensive manual replay logic out of the old scope_end-heavy model, so boundary discovery happens at submit time and scope_end is a much smaller publish barrier
  • add guard tests, boundary tests, partial-manual paged-attention scenes, benchmark support, and the detailed design/benchmark note in docs/manual-dep-for-tensormap-design.md

What changed because of the rebase to newest main

These are not new feature goals; they are alignment work required to land the branch on top of current main:

  • adapt the branch to the newer runtime/simulation substrate already merged in main
    • handle-based DeviceRunner / runtime API refactors
    • CPU-sim / TPUSH-TPOP isolation changes
    • newer test and example harness layout
  • adapt the paged-attention examples to the newer mainline kernel ABI
    • mainline paged-attention orchestration now passes block_table as a tensor and bt_offset as a scalar
  • clean up branch-only scaffolding that should not stay in the PR branch
    • the unmodified tensormap baseline stays worktree-only for benchmarking and is no longer carried in this branch
  • refresh benchmark paths / scripts / docs so they match the post-rebase tree

Rebase-specific fix in the current tip

The latest commit (e5fa1bc, Fix: align rebased unroll partial-manual ABI) fixes a rebase-induced example mismatch, not a manual-scope runtime bug.

The issue was:

main changed paged_attention_unroll kernel ABI
    -> partial-manual unroll example still used old argument pattern
    -> rebased scene timed out / failed

The fix updates the partial-manual unroll orchestration to pass the same logical inputs the rebased mainline kernels now expect:

  • block_table as a tensor input
  • bt_offset as the per-block scalar offset

So the failure came from stale example-side orchestration after the rebase, not from the manual-scope dependency runtime itself.

Verification after the rebase fix

Verified on real device after the ABI fix:

  • paged_attention_unroll_partial_manual / Case1: pass
  • paged_attention_unroll_partial_manual / Case2: pass

The design note has also been refreshed with the post-rebase benchmark commands and current comparison tables.

Reviewer-facing takeaway

The important thing to read in this force-push is:

  • the manual-scope feature set is still the same product change
  • part of the diff churn comes from rebasing onto a significantly newer main
  • the latest tip mainly resolves rebase drift in examples / scripts so the rebased branch is runnable again on current main

Fresh rebased performance table

Settings:

  • platform: a2a3
  • device: 4
  • rounds: 5
  • PTO-ISA commit: d96c8784

Units are elapsed_us (orch_us). aicpu_build_graph does not emit the same orch timing lines, so only elapsed time is shown there.

paged_attention

Case aicpu_build_graph tensormap_and_ringbuffer_unmodified tensormap_and_ringbuffer tensormap_and_ringbuffer_partial_manual
Case1 29937.7 36095.9 (36094.9) 39148.7 (39148.3) 34186.3 (34025.7)
Case2 16762.7 18639.5 (18635.1) 19813.0 (19812.7) 18028.7 (17618.4)

paged_attention_unroll

Case aicpu_build_graph tensormap_and_ringbuffer_unmodified tensormap_and_ringbuffer tensormap_and_ringbuffer_partial_manual
Case1 1425.3 1325.6 (835.3) 1173.2 (992.0) 1160.4 (968.8)
Case2 693.0 628.7 (380.7) 567.9 (435.6) 561.9 (416.6)

@uv-xiao uv-xiao marked this pull request as ready for review April 10, 2026 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant