[WIP] Perf: cache hash and prefetch chain in TensorMap lookup/insert by chenshengxin2026 · Pull Request #464 · hw-native-sys/simpler

chenshengxin2026 · 2026-04-07T09:24:46Z

Summary

Cache the hash(addr) result from lookup() and reuse it in the subsequent insert() call for INOUT tensors, eliminating a redundant 64-bit multiply per tensor
Add software prefetch of next_in_bucket during chain traversal to hide memory latency on chains longer than one entry
Add lookup/insert/link_entry overloads that accept precomputed hash

Benchmark Results

Ascend910 (device 11, 100 rounds, 3 runs averaged):

Example	Baseline (us)	Optimized (us)	Delta
alternating_matmul_add	977.9	971.4	-0.7%
benchmark_bgemm	747.6	719.2	-3.8%
paged_attention_unroll Case1	1165.0	1158.2	-0.6%
paged_attention_unroll Case2	555.2	554.0	-0.2%
batch_paged_attention	3259.2	3239.0	-0.6%

The bgemm improvement is expected: it has the highest lookup+dep percentage (45.8% of orch time) and uses INOUT tensors extensively.

gemini-code-assist

Code Review

This pull request optimizes the PTO2TensorMap by caching hash values during lookup and reusing them during insertion for INOUT tensors in pto2_submit_mixed_task. Additionally, it introduces a prefetch instruction in the lookup loop to improve cache performance when traversing bucket chains. I have no feedback to provide.

- Cache the hash(addr) result from lookup() and reuse it in the subsequent insert() call for INOUT tensors, eliminating a redundant 64-bit multiply per tensor - Add software prefetch of next_in_bucket during chain traversal to hide memory latency on chains longer than one entry - Add lookup/insert/link_entry overloads that accept precomputed hash Benchmarked on Ascend910 (device 11, 100 rounds, 3 runs averaged): benchmark_bgemm -3.8%, other workloads -0.2% to -0.7%.

gemini-code-assist bot reviewed Apr 7, 2026

View reviewed changes

chenshengxin2026 changed the title ~~Perf: cache hash and prefetch chain in TensorMap lookup/insert~~ [WIP] Perf: cache hash and prefetch chain in TensorMap lookup/insert Apr 7, 2026

chenshengxin2026 force-pushed the perf/orch-lookup-prefetch-hash-cache branch from b688858 to a0eff4b Compare April 7, 2026 11:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Perf: cache hash and prefetch chain in TensorMap lookup/insert#464

[WIP] Perf: cache hash and prefetch chain in TensorMap lookup/insert#464
chenshengxin2026 wants to merge 1 commit intohw-native-sys:mainfrom
chenshengxin2026:perf/orch-lookup-prefetch-hash-cache

chenshengxin2026 commented Apr 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chenshengxin2026 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark Results

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chenshengxin2026 commented Apr 7, 2026 •

edited

Loading