-
-
Notifications
You must be signed in to change notification settings - Fork 215
Fast and Lean JavaScript Dictionaries #880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
It appears I used the new FFI syntax while the test runner is still using v1.11. Does the standard library support older Gleam versions as well? |
|
Nope, you can update the CI version and the |
|
Ah yes, you'll need to reformat that file to use the latest formatting style also |
|
Yeah, thank you :D |
lpil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic! I am so impressed! Thank you
I wonder if there's any further testing we want to do to check for regressions here? Property testing or larger example test suite? Or are we confident enough with the existing tests?
src/gleam/dict.gleam
Outdated
|
|
||
| @external(erlang, "maps", "put") | ||
| @external(javascript, "../dict.mjs", "put") | ||
| fn put(key: k, value: v, transient: TransientDict(k, v)) -> TransientDict(k, v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give this a clearer name please, put is quite nondescript 🙏 One that communicates it will mutate when possible would be fab.
src/dict.mjs
Outdated
| case COLLISION_NODE: | ||
| return withoutCollision(root, key); | ||
| } | ||
| export function put(key, value, transient) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, a clearer name please 🙏
src/dict.mjs
Outdated
| } | ||
| ++i; | ||
|
|
||
| function doPut(transient, node, key, value, hash, shift) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better name for this? And also for doRemove?
If not, can we call it insert please, as that's the term we use. put is an Erlang term so good not to mix up our basic terminology.
|
Thank you!! The existing tests were already super useful for finding bugs and I thought they were quite good already; for I hope the names are better now, I'm really bad at those ~ 💜 |
|
Love the names, thank you.
Could be useful perhaps! What do you think? Do we have confidence to merge this now or do we want to do more beforehand? |
|
Hi! Sorry this took so long. I added some tests generating a random sequence of insert/delete operations and compares the result to proplists. I ran those a bunch locally and nothing broke (yet), so now I'm relatively confident that nothing will immediately break :) |
|
idk why you're saying sorry 😁 |
lpil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is such fantastic work!!! Thank you Yoshie! You continue to amaze me!
…t upcast to a float by accident
|
Thank you @yoshi-monster 💚 |
Hi! Sorry I did a bunch of benchmarks, so wall of text incoming.
tl;dr: The new version is much simpler and potentially faster, although it's not super obvious right now yet until we optimise other bits of the runtime. I think it's still overall an improvement.
This PR rewrites the Dict implementation for the JavaScript target, implementing the CHAMP (Compressed Hash Array Mapped Prefix-trees) data structure as described by M.J. Steindorfer and J.J. Vinju in Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collections (2015, link). It also adds some optimizations found in ClojureScript, attributed to Christopher Grand by P. Schluck and C. Rodgers in their 2017 Talk Closure Hash Maps: Plenty of room at the bottom (link). Namely, it adds internal "transient" objects that allow for mutating the dictionary in controlled ways internally.
The current implementation is already quite good, so only minor performance improvements could be achieved for most operations in practice; I did a lot of benchmarks to figure out what's going on and what's possible, so sorry for the wall of tables.
Highlights
getandinsertoperationsEstablishing a baseline
I think it's useful to look at the absolute best performance we could achieve first to know what we're up against. All benchmarks were done on an M4 Pro 14-core chip using Node 24.7 and Bun 1.21.2, with mitata as the benchmark runner.
This measures the performance of raw JavaScript Map calls. Keep in mind that this is not a suitable replacement in practice, as it only supports strings or numbers as keys and does not provide a persistent/immutable API at all.
Without further ado, here's what we're up against:
It looks like it takes around 10ns to get an element, and around 50ns to insert one. The exact numbers here aren't that interesting, but I think it always helps to have a target and intuition how fast things should/could be. Note that this is still quite slow compared to JIT-optimised object shapes! These can be accessed and updated within a fraction of a nano-second.
It's also useful to think about what these numbers mean: at the 4 GHz or so my processor claims, one cycle is 0.25ns. Most instructions take a bunch of cycles to retire, so we can simplify a ton and just say 1ns = 1 instruction on average. A number like 4ns (for
get) means that the CPU executed a single-digit number of instructions to get the value we wanted! Sometimes adding a branch increased latency by ~5ns; there are certainly more low-level optimisations to be done by someone with deeper knowledge of the code v8 generates, but we are actually at the point of counting instructions here!Implementation
The rewritten dictionary is a faithful implementation of the version presented in Steindorfer and Vinjus paper, with the addition of transients to allow for fast updates. The class-based Javascript API has been fully removed; this allows the dictionary and the hash function to be dead-code-eliminated almost entirely even if it is referenced incidentally through
string.inspectordynamic.classify. Specialised algorithms formap,insert, andhas_keyare provided to speed up common dictionary operations. I cleaned up the FFI API to no longer have this additional layer of indirection throughgleam_stdlib.mjs; I think having the dictionary self-contained inside it's own file makes the code easier to maintain. Many of the public functions in the dictionary module have been replaced by more efficient versions on both targets.Instead of 4 different node types, the new version only has a single (+ a "deformed" version on hash collisions) internal node type, eliminating most of the incidental complexity found within the old implementation. Overall, the dictionary went from roughly 560 LOC to around 290, a reduction of almost 50%.
The new
deletealgorithm also respects the CHAMP invariant, effectively banning nested node paths that only hold a single entry. Instead, lone entries are "pulled up" into the parent node recursively whenever possible. As a consequence, all CHAMP trees are guaranteed to be in their single, canonical, compact tree representation. This is particularly interesting for equality checks: Instead of having to walk all elements of the dictionary, the nodes can be compared directly. Nodes that are reference-equal or have different bitmaps don't have to be compared further. Dictionaries don't have to have their ownequalsorhashCodeimplementation anymore, but standard structural equality not only works, but also improves the complexity from O(n log n) to O(log n) on average.While iterating, key order can change compared to the old version with this PR. Since dictionaries don't guarantee any order, no effort has been put into preserving the order at all.
Benchmarks
Result construction,
isEqual, and the hash function contribute significantly to the runtime of various dict operations. To make things more directly comparable, I provide "real" numbers including their overhead as well as "adjusted" variants where all of these operations have been replaced by no-ops, or strict equality in the case ofisEqual. These measure the performance of the data structure itself more directly, which also makes them more comparable to the "baseline" measurements provided above.getSince
getis so fast, I generated an array of 1000 random keys that I all fetched in a loop. The numbers reported are the average for 1 singlegetoperation. Hopefully this avoids measuring artifacts and memory locality effects. It's hard to explain why using results would be relatively slower the bigger the dictionary gets. Interestingly, the same effect doesn't happen when fetching the same element 1000 times, or when fetching random elements but using object literals ({ success: true, value: ... }) for results:Benchmarking hard 😔 While we know that objects are faster than the current results, I also think it's unlikely that they would be that much faster too.
In the the Lustre diff benchmark (which is mostly
getandhas_key), performance is improved by 2-3 times compared to v0.65.0, getting within within 30% of using native mutable maps, and even beating them for common cases:Overall, without overhead
gethas been improved by roughly 30% for small dictionaries, and is up to 5 times faster for very large ones. The adjusted values compare favourably to even built-in maps using v8, suggesting that a custom persistent immutable data structures can be as fast as natively implemented data structures provided by the runtime. Since Buns maps are much more performant in this benchmark I suspect that v8's native implementation could also be improved. For Bun and JavaScriptCore, the new dictionary seems to be especially faster.insertBoth old and new versions beat built-in maps consistently for single inserts into maps up to 100 elements. While performance of CHAMP starts to become better for the largest dictionaries, it is more affected by outliers in its internal structure, making the 1k elements case particularly slow.
from_listfrom_listshows the power of transients, making it over twice as fast than a copying implementation for large dictionaries. A version exceeding native Maps speed can be achieved, but would require making the single insert case slower.foldThe old
foldimplementation usedto_listunder the hood, so performance improved massively. Not only that, but the new dictionary can be iterated faster than built-in maps, too!Discussion
All of the tested operations perform at least equally as fast as the current implementation. Most show significant performance increases as the dictionaries get bigger, but (except for
insert) this doesn't seem to come at a cost for small dictionaries. While CHAMP features some highly specialized update routines, it is still less than half the amount of code as the current version.Eliminating additional overhead, we can see that CHAMP can be competitive with built-in maps for get and insert operations up to dictionaries of around 1000 elements. Many of the benchmarks show a significant overhead for both result construction and
isEquals. Both versions of the dictionary readily beat built-in maps once iteration or persistence is involved.Future work
While the immediate space for optimizations has been thoroughly explored, there might still be improvements by finding better mutating algorithms, or by exploring caching with the MEMCHAMP variant. While equality already uses the internal structure of the map, I suspect that similar optimizations could be done to all set-like operations (
merge,union,intersection,difference, etc), using the internal structure to quickly combine nodes instead of working on the element level.Since bulk operations are particularly fast, I think adding the missing
merge_listorinsert_from_listfunction might be useful. To avoid the overhead from results, a function similar toget_or_defaultcould be provided.Moving dictionary into the compiler and parameterising it over the used
getHashandisEqualsfunction would allow the compiler to inject monomorphised versions of these functions for the concrete key type used, skipping the generic implementations alltogether. Long term, escape analysis or similar techniques could be used to insert transients automatically.While working on the dictionary, I noticed that the tag is ignored for the hash code of all custom types, meaning that
Ok(0)orError(0)hash to the same value. This mostly affects variants without attached data, which all hash to0, causing the dictionary to degrade to a linear search. I'll open up a discussion at some point since I have some additional thought about the hash an equals functions ^^Related work
The results presented here and by Steindorfer and Vinju have been confirmed independently by the Scala, Closure and ClosureScript communities, all implementing a variant of CHAMP as their default
HashMapimplementation. I also referenced this Go implementation for their handling of transients.Appendix: Benchmark Code
Benchmark code