Skip to content

[WIP] Perf: cache hash and prefetch chain in TensorMap lookup/insert#464

Open
chenshengxin2026 wants to merge 1 commit intohw-native-sys:mainfrom
chenshengxin2026:perf/orch-lookup-prefetch-hash-cache
Open

[WIP] Perf: cache hash and prefetch chain in TensorMap lookup/insert#464
chenshengxin2026 wants to merge 1 commit intohw-native-sys:mainfrom
chenshengxin2026:perf/orch-lookup-prefetch-hash-cache

Conversation

@chenshengxin2026
Copy link
Copy Markdown
Contributor

@chenshengxin2026 chenshengxin2026 commented Apr 7, 2026

Summary

  • Cache the hash(addr) result from lookup() and reuse it in the subsequent insert() call for INOUT tensors, eliminating a redundant 64-bit multiply per tensor
  • Add software prefetch of next_in_bucket during chain traversal to hide memory latency on chains longer than one entry
  • Add lookup/insert/link_entry overloads that accept precomputed hash

Benchmark Results

Ascend910 (device 11, 100 rounds, 3 runs averaged):

Example Baseline (us) Optimized (us) Delta
alternating_matmul_add 977.9 971.4 -0.7%
benchmark_bgemm 747.6 719.2 -3.8%
paged_attention_unroll Case1 1165.0 1158.2 -0.6%
paged_attention_unroll Case2 555.2 554.0 -0.2%
batch_paged_attention 3259.2 3239.0 -0.6%

The bgemm improvement is expected: it has the highest lookup+dep percentage (45.8% of orch time) and uses INOUT tensors extensively.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the PTO2TensorMap by caching hash values during lookup and reusing them during insertion for INOUT tensors in pto2_submit_mixed_task. Additionally, it introduces a prefetch instruction in the lookup loop to improve cache performance when traversing bucket chains. I have no feedback to provide.

@chenshengxin2026 chenshengxin2026 changed the title Perf: cache hash and prefetch chain in TensorMap lookup/insert [WIP] Perf: cache hash and prefetch chain in TensorMap lookup/insert Apr 7, 2026
- Cache the hash(addr) result from lookup() and reuse it in the
  subsequent insert() call for INOUT tensors, eliminating a redundant
  64-bit multiply per tensor
- Add software prefetch of next_in_bucket during chain traversal to
  hide memory latency on chains longer than one entry
- Add lookup/insert/link_entry overloads that accept precomputed hash

Benchmarked on Ascend910 (device 11, 100 rounds, 3 runs averaged):
benchmark_bgemm -3.8%, other workloads -0.2% to -0.7%.
@chenshengxin2026 chenshengxin2026 force-pushed the perf/orch-lookup-prefetch-hash-cache branch from b688858 to a0eff4b Compare April 7, 2026 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant