-
Notifications
You must be signed in to change notification settings - Fork 14
First part of a HashSortedMap #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
10f276a
Add SIMD and no-hint benchmark variants for PrefixHashMap
aneubeck 77742a2
Fast version
aneubeck 08e46dc
make it generic
aneubeck 8464421
fix sse version
aneubeck 0244f8f
cleanup
aneubeck 7d09f3f
replace vec with box
aneubeck fba4bb2
add entry function
aneubeck 127798c
Revert unnecessary change.
aneubeck 7eaf609
Simplify enums
aneubeck 0ecf083
some documentation
aneubeck 200f837
reorganize
aneubeck 5fffdea
Merge branch 'main' into aneubeck/prefixmap
aneubeck 427d982
remove gxhash which doesn't compile with some configurations
aneubeck 7e8097d
Merge branch 'aneubeck/prefixmap' of https://github.com/github/rust-g…
aneubeck 7195a44
Update crates/hash-sorted-map/src/lib.rs
aneubeck 4e1a038
fix initial capacity (and typo)
aneubeck 865757a
lints + build errors
aneubeck 0ebfb79
and more :(
aneubeck d213be8
Update equivalence.rs
aneubeck f36c8df
Add test for growth with collisions
jorendorff 908d3e9
Update crates/hash-sorted-map/src/hash_sorted_map.rs
aneubeck 7cac054
Apply suggestions from code review
aneubeck 7d2a74b
address review comments
aneubeck 2a5c666
last comments
aneubeck File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| [package] | ||
| name = "hash-sorted-map" | ||
| authors = ["The blackbird team <support@github.com>"] | ||
| version = "0.1.0" | ||
| edition = "2021" | ||
| description = "A hash map with hash-ordered iteration and linear-time merge, designed for search-index term maps." | ||
| repository = "https://github.com/github/rust-gems" | ||
| license = "MIT" | ||
| keywords = ["hashmap", "sorted", "merge", "simd"] | ||
| categories = ["algorithms", "data-structures"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,171 @@ | ||
| # HashSortedMap vs. Rust Swiss Table (hashbrown): Optimization Analysis | ||
|
|
||
| ## Executive Summary | ||
|
|
||
| `HashSortedMap` is a Swiss-table-inspired hash map that uses **overflow | ||
| chaining** (instead of open addressing), **SIMD group scanning** (NEON/SSE2), | ||
| a **slot-hint fast path**, and an **optimized growth strategy**. It is generic | ||
| over key type, value type, and hash builder. | ||
|
|
||
| This document analyzes the design trade-offs versus | ||
| [hashbrown](https://github.com/rust-lang/hashbrown) and records the | ||
| experimental results that guided the current design. | ||
|
|
||
| --- | ||
|
|
||
| ## Architecture Comparison | ||
|
|
||
| ``` | ||
| ┌──────────────────────────────────────────────────────────────────┐ | ||
| │ hashbrown Swiss Table │ | ||
| │ │ | ||
| │ Single contiguous allocation (SoA): │ | ||
| │ [Padding] [T_n ... T_1 T_0] [CT_0 CT_1 ... CT_n] [CT_extra] │ | ||
| │ data control bytes (mirrored) │ | ||
| │ │ | ||
| │ • Open addressing, triangular probing │ | ||
| │ • 16-byte groups (SSE2) or 8-byte groups (NEON/generic) │ | ||
| │ • EMPTY / DELETED / FULL tag states │ | ||
| └──────────────────────────────────────────────────────────────────┘ | ||
|
|
||
| ┌──────────────────────────────────────────────────────────────────┐ | ||
| │ HashSortedMap │ | ||
| │ │ | ||
| │ Vec<Group<K,V>> where each Group (AoS): │ | ||
| │ { ctrl: [u8; 8], keys: [MaybeUninit<K>; 8], │ | ||
| │ values: [MaybeUninit<V>; 8], overflow: u32 } │ | ||
| │ │ | ||
| │ • Overflow chaining (linked groups) │ | ||
| │ • 8-byte groups with NEON/SSE2/scalar SIMD scan │ | ||
| │ • EMPTY / FULL tag states only (insertion-only, no deletion) │ | ||
|
aneubeck marked this conversation as resolved.
|
||
| │ • Slot-hint fast path │ | ||
| └──────────────────────────────────────────────────────────────────┘ | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Optimizations Investigated | ||
|
|
||
| ### 1. SIMD Group Scanning ✅ Implemented | ||
|
|
||
| Platform-specific SIMD for control byte matching: | ||
| - **aarch64**: NEON `vceq_u8` + `vreinterpret_u64_u8` (8-byte groups) | ||
| - **x86_64**: SSE2 `_mm_cmpeq_epi8` + `_mm_movemask_epi8` (16-byte groups) | ||
| - **Fallback**: Scalar u64 zero-byte detection trick | ||
|
|
||
| **Benchmark result**: ~5% faster than scalar on Apple M-series. The gain is | ||
| modest because the slot-hint fast path often skips the group scan entirely. | ||
|
|
||
| ### 2. Open Addressing with Triangular Probing ❌ Rejected | ||
|
|
||
| This is not really an option for this hash map, since it would prevent efficient sorting. | ||
| Additionally, we didn't observe any performance improvement in comparison to the linked overflow buffer approach. | ||
| The biggest benefit of triangular probing is that it allows a much higher load factor, i.e. reduces memory consumption which isn't our main concern though. | ||
|
|
||
| **Benchmark result**: **40% slower** than overflow chaining. With the AoS | ||
| layout, each group is ~112 bytes, so probing to the next group jumps over | ||
| large memory regions. Overflow chaining with the slot-hint fast path is | ||
| faster because most inserts land in the first group. | ||
|
|
||
| ### 3. SoA Memory Layout ❌ Rejected | ||
|
|
||
| Tested a SoA variant (`SoaHashSortedMap`) with separate control byte and | ||
| key/value arrays, combined with triangular probing. | ||
|
|
||
| **Benchmark result**: **Slowest variant** — even slower than AoS open | ||
| addressing. The two-Vec SoA layout doubles TLB/cache pressure versus | ||
| hashbrown's single-allocation layout. Without the single-allocation trick, | ||
| SoA is worse than AoS for this use case. | ||
|
|
||
| ### 4. Capacity Sizing ✅ Implemented | ||
|
|
||
| Without the correct sizing, there was always the penality of a grow operation. | ||
|
|
||
| **Fix**: Changed to ~70% max load factor. This was the **single biggest improvement** — HashSortedMap went from 2× slower to matching hashbrown. | ||
|
|
||
| ### 5. Optimized Growth ✅ Implemented | ||
|
|
||
| The original `grow()` called the full `insert()` for each element (including | ||
| duplicate checking and overflow traversal). hashbrown uses: | ||
| - `find_insert_index` (skip duplicate check) | ||
| - `ptr::copy_nonoverlapping` (raw memory copy) | ||
| - Bulk counter updates | ||
|
|
||
| **Fix**: Added `insert_for_grow()` that skips duplicate checking, uses raw | ||
| pointer copies, and iterates occupied slots via bitmask. | ||
|
|
||
| **Benchmark result**: Growth is now **2× faster** than hashbrown (4.8 µs vs | ||
| 9.8 µs for 3 resize rounds). | ||
|
|
||
| ### 6. Branch Prediction Hints ⚠️ Mixed Results | ||
|
|
||
| Added `likely()`/`unlikely()` annotations and `#[cold] #[inline(never)]` on | ||
| the overflow path. | ||
|
|
||
| **Benchmark result**: Helped the scalar version (~2–6% faster) but **hurt the | ||
| SIMD version** by pessimizing NEON code generation. Removed from the SIMD | ||
| implementation, kept in the scalar version. | ||
|
|
||
| ### 7. Slot Hint Fast Path (Unique to HashSortedMap) | ||
|
|
||
| HashSortedMap checks a preferred slot before scanning the group: | ||
| ```rust | ||
| let hint = slot_hint(hash); // 3 bits from hash → slot index | ||
| if ctrl[hint] == EMPTY { /* direct insert */ } | ||
| if ctrl[hint] == tag && keys[hint] == key { /* direct hit */ } | ||
| ``` | ||
|
|
||
| hashbrown does **not** have this optimization — it always does a full SIMD | ||
| group scan. The reason why the performance is different is probably due to the different overflow strategies and the different load factors. | ||
|
|
||
| ### 8. Overflow Reserve Sizing ✅ Validated | ||
|
|
||
| Tested overflow reserves from 0% to 100% of primary groups: | ||
|
|
||
| | Reserve | Growth scenario (µs) | | ||
| |---------|----------------------| | ||
| | m/8 (12.5%, default) | 8.04 | | ||
| | m/4 (25%) | 8.33 | | ||
| | m/2 (50%) | 8.93 | | ||
| | m/1 (100%) | 10.31 | | ||
| | 0 (grow immediately) | 6.96 | | ||
|
|
||
| **Conclusion**: Smaller reserves are faster — growing early is cheaper than | ||
| traversing overflow chains. | ||
|
|
||
| ### 9. IdentityHasher Fix ✅ Implemented | ||
|
|
||
| The original `IdentityHasher` zero-extended u32 to u64, putting zeros in the | ||
| top 32 bits. Since hashbrown derives the 7-bit tag from `hash >> 57`, every | ||
| entry got the same tag — completely defeating control byte filtering. | ||
|
|
||
| **Fix**: Use `folded_multiply` to expand u32 keys to u64 with independent | ||
| entropy in both halves. Also changed trigram generation to use | ||
| `folded_multiply` instead of murmur3. | ||
|
|
||
| --- | ||
|
|
||
| ## Optimizations Not Implemented (and Why) | ||
|
|
||
| | Optimization | Reason | | ||
| |---------------------------------|------------------------------------------| | ||
| | **Tombstone / DELETED support** | Insertion-only map — no deletions needed | | ||
| | **In-place rehashing** | No tombstones to reclaim | | ||
| | **Control byte mirroring** | Not needed with overflow chaining (no wrap-around) | | ||
| | **Custom allocator support** | Out of scope for benchmarking | | ||
| | **Over-allocation utilization** | Uses `Vec` (no raw allocator control) | | ||
|
|
||
| --- | ||
|
|
||
| ## Summary of Impact | ||
|
|
||
| | Change | Effect on insert time | | ||
| |----------------------------|------------------------------| | ||
| | Capacity sizing fix | **−50%** (biggest win) | | ||
| | Optimized growth path | **−10%** on growth scenarios | | ||
| | SIMD group scanning | **−5%** | | ||
| | Branch hints (scalar only) | **−2–6%** | | ||
| | IdentityHasher fix | Enabled fair comparison | | ||
|
|
||
| The current HashSortedMap **matches hashbrown+FxHash** on pre-sized inserts, | ||
| **beats all hashbrown variants** on overwrites, and has **2× faster growth**. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| # hash-sorted-map | ||
|
|
||
| A hash map whose groups are ordered by hash prefix, enabling efficient | ||
| sorted-order iteration and linear-time merging of two maps. | ||
|
|
||
| ## Motivation | ||
|
|
||
| In a search index, each document produces a **term map** (term → frequency). | ||
| At index time, term maps from many documents must be **merged** into a single | ||
| posting list, and the result is **serialized in hash-key order** so that | ||
| lookups can use a skip-list approach, leveraging the hash ordering to | ||
| efficiently jump to the right region of the serialized data. | ||
|
|
||
| A conventional hash map stores entries in arbitrary order, so merging two maps | ||
| requires collecting, sorting, and reshuffling all entries — an expensive step | ||
| that dominates indexing time for large term maps typical of code search, where | ||
| documents contain massive numbers of tokens. | ||
|
|
||
| `HashSortedMap` avoids this by organizing its groups by hash prefix. | ||
| Iterating through the groups in order yields entries sorted by their hashed | ||
| keys, which means: | ||
|
|
||
| - **Merging** two maps is a single linear scan (like merge-sort's merge step). | ||
| - **Serialization** in hash-key order requires no extra sorting or copying. | ||
|
|
||
| ## Design | ||
|
|
||
| `HashSortedMap<K, V, S>` is a Swiss-table-inspired hash map that uses: | ||
|
|
||
| - **Overflow chaining** instead of open addressing — groups that fill up link | ||
| to overflow groups rather than probing into neighbours. | ||
| - **Slot hint** — a preferred slot index derived from the hash, checked before | ||
| scanning the group. Gives a direct hit on most inserts at low load. | ||
| - **SIMD group scanning** — uses NEON on aarch64, SSE2 on x86\_64, and a | ||
| scalar fallback elsewhere to scan 8–16 control bytes in parallel. | ||
| - **AoS group layout** — each group stores its control bytes, keys, and values | ||
|
aneubeck marked this conversation as resolved.
|
||
| together, keeping a single insert's data within 1–2 cache lines. | ||
| - **Optimized growth** — during resize, elements are re-inserted without | ||
| duplicate checking and copied via raw pointers. | ||
| - **Generic key/value/hasher** — supports any `K: Hash + Eq`, any | ||
| `S: BuildHasher`, and `Borrow<Q>`-based lookups. | ||
|
|
||
| ## Benchmark results | ||
|
|
||
| All benchmarks insert 1000 random trigram hashes (scrambled with | ||
| `folded_multiply`) into maps with various configurations. Measured on Apple | ||
| M-series (aarch64). | ||
|
|
||
| ### Insert 1000 trigrams — pre-sized, no growth | ||
|
|
||
| | Rank | Map | Time (µs) | vs best | | ||
| |------|-----|-----------|---------| | ||
| | 🥇 | FoldHashMap | 2.44 | — | | ||
| | 🥈 | FxHashMap | 2.61 | +7% | | ||
| | 🥉 | hashbrown::HashMap | 2.67 | +9% | | ||
| | 4 | **HashSortedMap** | **2.71** | +11% | | ||
| | 5 | hashbrown+Identity | 2.74 | +12% | | ||
| | 6 | std::HashMap+FNV | 3.27 | +34% | | ||
| | 7 | AHashMap | 3.22 | +32% | | ||
| | 8 | std::HashMap | 8.49 | +248% | | ||
|
|
||
| ### Re-insert same keys (all overwrites) | ||
|
|
||
| | Map | Time (µs) | | ||
| |-----|-----------| | ||
| | **HashSortedMap** | **2.36** ✅ | | ||
| | hashbrown+Identity | 2.58 | | ||
|
|
||
| ### Growth from small (`with_capacity(128)`, 3 resize rounds) | ||
|
|
||
| | Map | Time (µs) | Growth penalty | | ||
| |-----|-----------|----------------| | ||
| | **HashSortedMap** | **4.85** | +2.14 | | ||
| | hashbrown+Identity | 9.77 | +7.03 | | ||
|
|
||
| ### Key takeaways | ||
|
|
||
| - **HashSortedMap matches the fastest hashbrown configurations** on pre-sized | ||
| first-time inserts and is **the fastest for overwrites**. | ||
| - **Growth is ~2× faster** than hashbrown thanks to the optimized | ||
| `insert_for_grow` path that skips duplicate checking and uses raw copies. | ||
| - The remaining gap to FoldHashMap (~11%) comes from foldhash's extremely | ||
| efficient hash function that pipelines well with hashbrown's SIMD scan. | ||
|
|
||
| ## Running | ||
|
|
||
| ```sh | ||
| cargo bench --bench hashmap_insert | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| [package] | ||
| name = "hash-sorted-map-benchmarks" | ||
| edition = "2021" | ||
|
|
||
| [lib] | ||
| path = "lib.rs" | ||
| test = false | ||
|
|
||
| [[bench]] | ||
| name = "performance" | ||
| path = "performance.rs" | ||
| harness = false | ||
| test = false | ||
|
|
||
| [dependencies] | ||
| hash-sorted-map = { path = ".." } | ||
| criterion = "0.8" | ||
| rand = "0.10" | ||
| rustc-hash = "2" | ||
| ahash = "0.8" | ||
| hashbrown = "0.15" | ||
| foldhash = "0.1" | ||
| fnv = "1" | ||
|
aneubeck marked this conversation as resolved.
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| use std::hash::{BuildHasherDefault, Hasher}; | ||
|
|
||
| use rand::RngExt; | ||
|
|
||
| const ARBITRARY0: u64 = 0x243f6a8885a308d3; | ||
|
|
||
| /// Folded multiply: full u64×u64→u128, then XOR the two halves. | ||
| #[inline(always)] | ||
| pub fn folded_multiply(x: u64, y: u64) -> u64 { | ||
| let full = (x as u128).wrapping_mul(y as u128); | ||
| (full as u64) ^ ((full >> 64) as u64) | ||
| } | ||
|
|
||
| /// A hasher that passes through u32 keys without hashing, suitable for | ||
| /// keys that are already well-distributed. | ||
| #[derive(Default)] | ||
| pub struct IdentityHasher(u64); | ||
|
|
||
| impl Hasher for IdentityHasher { | ||
| fn write(&mut self, _bytes: &[u8]) { | ||
| unimplemented!("IdentityHasher only supports write_u32"); | ||
| } | ||
| fn write_u32(&mut self, i: u32) { | ||
| self.0 = (i as u64) | ((i as u64) << 32); | ||
| } | ||
| fn finish(&self) -> u64 { | ||
| self.0 | ||
| } | ||
| } | ||
|
|
||
| pub type IdentityBuildHasher = BuildHasherDefault<IdentityHasher>; | ||
|
|
||
| /// Generate `n` random trigrams as well-distributed u32 hashes. | ||
| /// Each trigram is packed into a u32, then scrambled with folded_multiply. | ||
| pub fn random_trigram_hashes(n: usize) -> Vec<u32> { | ||
| let mut rng = rand::rng(); | ||
| (0..n) | ||
| .map(|_| { | ||
| let a = rng.random_range(b'a'..=b'z') as u32; | ||
| let b = rng.random_range(b'a'..=b'z') as u32; | ||
| let c = rng.random_range(b'a'..=b'z') as u32; | ||
| let packed = a | (b << 8) | (c << 16); | ||
| folded_multiply(packed as u64, ARBITRARY0) as u32 | ||
| }) | ||
| .collect() | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.