Skip to content

Conversation

@yoshi-monster
Copy link
Contributor

@yoshi-monster yoshi-monster commented Nov 16, 2025

Hi! Sorry I did a bunch of benchmarks, so wall of text incoming.

tl;dr: The new version is much simpler and potentially faster, although it's not super obvious right now yet until we optimise other bits of the runtime. I think it's still overall an improvement.


This PR rewrites the Dict implementation for the JavaScript target, implementing the CHAMP (Compressed Hash Array Mapped Prefix-trees) data structure as described by M.J. Steindorfer and J.J. Vinju in Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collections (2015, link). It also adds some optimizations found in ClojureScript, attributed to Christopher Grand by P. Schluck and C. Rodgers in their 2017 Talk Closure Hash Maps: Plenty of room at the bottom (link). Namely, it adds internal "transient" objects that allow for mutating the dictionary in controlled ways internally.

The current implementation is already quite good, so only minor performance improvements could be achieved for most operations in practice; I did a lot of benchmarks to figure out what's going on and what's possible, so sorry for the wall of tables.

Highlights

  • 50% reduction in code size
  • 10-30% faster get and insert operations
  • O(log n) equality checks, orders of magnitude faster bulk operations and iteration

Establishing a baseline

I think it's useful to look at the absolute best performance we could achieve first to know what we're up against. All benchmarks were done on an M4 Pro 14-core chip using Node 24.7 and Bun 1.21.2, with mitata as the benchmark runner.

This measures the performance of raw JavaScript Map calls. Keep in mind that this is not a suitable replacement in practice, as it only supports strings or numbers as keys and does not provide a persistent/immutable API at all.

Without further ado, here's what we're up against:

10 100 1k 10k 100k 1m
node / get 7.6ns 11ns 14ns 13ns 17ns 28ns
node / insert 54ns 52ns 53ns 54ns 54ns 88ns
node / from_list 400ns 2.5us 27us 654us 5.0ms 57ms
node / fold 30ns 223ns 2.2us 22us 230us 6.8ms
bun / get 4.6ns 4.9ns 6.3ns 6.8ns 8.4ns 12ns
bun / insert 49ns 53ns 53ns 54ns 53ns 74ns
bun / from_list 394ns 4.3us 35us 345us 4.2ms 68ms
bun / fold 1us 598ns 6us 61us 561us 6ms

It looks like it takes around 10ns to get an element, and around 50ns to insert one. The exact numbers here aren't that interesting, but I think it always helps to have a target and intuition how fast things should/could be. Note that this is still quite slow compared to JIT-optimised object shapes! These can be accessed and updated within a fraction of a nano-second.

It's also useful to think about what these numbers mean: at the 4 GHz or so my processor claims, one cycle is 0.25ns. Most instructions take a bunch of cycles to retire, so we can simplify a ton and just say 1ns = 1 instruction on average. A number like 4ns (for get) means that the CPU executed a single-digit number of instructions to get the value we wanted! Sometimes adding a branch increased latency by ~5ns; there are certainly more low-level optimisations to be done by someone with deeper knowledge of the code v8 generates, but we are actually at the point of counting instructions here!

Implementation

The rewritten dictionary is a faithful implementation of the version presented in Steindorfer and Vinjus paper, with the addition of transients to allow for fast updates. The class-based Javascript API has been fully removed; this allows the dictionary and the hash function to be dead-code-eliminated almost entirely even if it is referenced incidentally through string.inspect or dynamic.classify. Specialised algorithms for map, insert, and has_key are provided to speed up common dictionary operations. I cleaned up the FFI API to no longer have this additional layer of indirection through gleam_stdlib.mjs; I think having the dictionary self-contained inside it's own file makes the code easier to maintain. Many of the public functions in the dictionary module have been replaced by more efficient versions on both targets.

Instead of 4 different node types, the new version only has a single (+ a "deformed" version on hash collisions) internal node type, eliminating most of the incidental complexity found within the old implementation. Overall, the dictionary went from roughly 560 LOC to around 290, a reduction of almost 50%.

The new delete algorithm also respects the CHAMP invariant, effectively banning nested node paths that only hold a single entry. Instead, lone entries are "pulled up" into the parent node recursively whenever possible. As a consequence, all CHAMP trees are guaranteed to be in their single, canonical, compact tree representation. This is particularly interesting for equality checks: Instead of having to walk all elements of the dictionary, the nodes can be compared directly. Nodes that are reference-equal or have different bitmaps don't have to be compared further. Dictionaries don't have to have their own equals or hashCode implementation anymore, but standard structural equality not only works, but also improves the complexity from O(n log n) to O(log n) on average.

While iterating, key order can change compared to the old version with this PR. Since dictionaries don't guarantee any order, no effort has been put into preserving the order at all.

Benchmarks

Result construction, isEqual, and the hash function contribute significantly to the runtime of various dict operations. To make things more directly comparable, I provide "real" numbers including their overhead as well as "adjusted" variants where all of these operations have been replaced by no-ops, or strict equality in the case of isEqual. These measure the performance of the data structure itself more directly, which also makes them more comparable to the "baseline" measurements provided above.

get

old / new 10 100 1k 10k 100k 1m
node / real 93/85ns 105/107ns 121/120ns 136/130ns 198/187ns 682/418ns
bun / real 42/35ns 48/39ns 55/44ns 76/64ns 165/130ns 765/583ns
node / adjusted 6.0/5.0ns 9.2/6.9ns 12/7.0ns 17/11ns 52/19ns 315/43ns
bun / adjusted 5.7/4.7ns 12/3.2ns 9.5/3.2ns 15/6.6ns 55/15ns 265/21ns

Since get is so fast, I generated an array of 1000 random keys that I all fetched in a loop. The numbers reported are the average for 1 single get operation. Hopefully this avoids measuring artifacts and memory locality effects. It's hard to explain why using results would be relatively slower the bigger the dictionary gets. Interestingly, the same effect doesn't happen when fetching the same element 1000 times, or when fetching random elements but using object literals ({ success: true, value: ... }) for results:

old / new 10 100 1k 10k 100k 1m
node / same 87/80ns 99/97ns 103/102ns 101/103ns 104/109ns 114/119ns
node / literal 26/25ns 32/27ns 32/26ns 49/32ns 96/78ns 552/217ns

Benchmarking hard 😔 While we know that objects are faster than the current results, I also think it's unlikely that they would be that much faster too.

In the the Lustre diff benchmark (which is mostly get and has_key), performance is improved by 2-3 times compared to v0.65.0, getting within within 30% of using native mutable maps, and even beating them for common cases:

Overall, without overhead get has been improved by roughly 30% for small dictionaries, and is up to 5 times faster for very large ones. The adjusted values compare favourably to even built-in maps using v8, suggesting that a custom persistent immutable data structures can be as fast as natively implemented data structures provided by the runtime. Since Buns maps are much more performant in this benchmark I suspect that v8's native implementation could also be improved. For Bun and JavaScriptCore, the new dictionary seems to be especially faster.

insert

old / new 10 100 1k 10k 100k 1m
node / update 50/54ns 89/93ns 109/109ns 128/122ns 217/189ns 629/422ns
bun / update 41/29ns 87/72ns 106/87ns 148/114ns 287/235ns 799/615ns
node / insert 33/39ns 48/46ns 83/129ns 94/89ns 144/114ns 345/234ns
bun / insert 37/37ns 53/48ns 99/120ns 111/103ns 181/131ns 618/257ns

Both old and new versions beat built-in maps consistently for single inserts into maps up to 100 elements. While performance of CHAMP starts to become better for the largest dictionaries, it is more affected by outliers in its internal structure, making the 1k elements case particularly slow.

from_list

old / new 10 100 1k 10k 100k 1m
node / real 295/251ns 7.5/6.6us 96/82us 958/988us 17/11ms 259/139ms
bun / real 292/253ns 8.8/6.2us 104/59us 1.2/0.8ms 25/11ms 392/148ms
node / adjusted 244/165ns 4.2/4.5us 52/42us 801/419us 14/7.4ms 213/102ms
bun / adjusted 195/312ns 5.0/6.6us 65/70ns 853/532us 16/8.1ms 259/91ms

from_list shows the power of transients, making it over twice as fast than a copying implementation for large dictionaries. A version exceeding native Maps speed can be achieved, but would require making the single insert case slower.

fold

old / new 10 100 1k 10k 100k 1m
node 210/15ns 1.6/0.2us 15/0.9us 338/15us 2.6/0.3ms 51/2.0ms
bun 100/8.8ns 1.2/0.2us 12/0.9us 110/14us 1.9/0.4ms 39/3.9ms

The old fold implementation used to_list under the hood, so performance improved massively. Not only that, but the new dictionary can be iterated faster than built-in maps, too!

Discussion

All of the tested operations perform at least equally as fast as the current implementation. Most show significant performance increases as the dictionaries get bigger, but (except for insert) this doesn't seem to come at a cost for small dictionaries. While CHAMP features some highly specialized update routines, it is still less than half the amount of code as the current version.

Eliminating additional overhead, we can see that CHAMP can be competitive with built-in maps for get and insert operations up to dictionaries of around 1000 elements. Many of the benchmarks show a significant overhead for both result construction and isEquals. Both versions of the dictionary readily beat built-in maps once iteration or persistence is involved.

Future work

While the immediate space for optimizations has been thoroughly explored, there might still be improvements by finding better mutating algorithms, or by exploring caching with the MEMCHAMP variant. While equality already uses the internal structure of the map, I suspect that similar optimizations could be done to all set-like operations (merge, union, intersection, difference, etc), using the internal structure to quickly combine nodes instead of working on the element level.

Since bulk operations are particularly fast, I think adding the missing merge_list or insert_from_list function might be useful. To avoid the overhead from results, a function similar to get_or_default could be provided.

Moving dictionary into the compiler and parameterising it over the used getHash and isEquals function would allow the compiler to inject monomorphised versions of these functions for the concrete key type used, skipping the generic implementations alltogether. Long term, escape analysis or similar techniques could be used to insert transients automatically.

While working on the dictionary, I noticed that the tag is ignored for the hash code of all custom types, meaning that Ok(0) or Error(0) hash to the same value. This mostly affects variants without attached data, which all hash to 0, causing the dictionary to degrade to a linear search. I'll open up a discussion at some point since I have some additional thought about the hash an equals functions ^^

Related work

The results presented here and by Steindorfer and Vinju have been confirmed independently by the Scala, Closure and ClosureScript communities, all implementing a variant of CHAMP as their default HashMap implementation. I also referenced this Go implementation for their handling of transients.

Appendix: Benchmark Code

Benchmark code
import { run, bench, boxplot, lineplot, summary, do_not_optimize } from "mitata";
import * as New from "./build/dev/javascript/gleam_stdlib/gleam/dict.mjs";
import * as Old from "./old-dict.mjs";

import * as List from "./build/dev/javascript/gleam_stdlib/gleam/list.mjs";
import { toList } from "./build/dev/javascript/prelude.mjs";

lineplot(() => {
  summary(() => {
    bench(`new.get($size)`, function* (state) {
      const size = state.get("size");
      const dict = New.from_list(List.map(List.range(1, size), (i) => [i, i]));
      yield {
        [0]: () =>
          Array.from(
            { length: 1000 },
            () => 1 + Math.trunc(Math.random() * size),
          ),
        bench: (is) => {
          for (let i = 0; i < 1000; ++i) do_not_optimize(New.get(dict, is[i]));
        },
      };
    })
      .range("size", 10, 1000000, 10)
      .gc("inner");
    bench(`old.get($size)`, function* (state) {
      const size = state.get("size");
      const dict = Old.from_list(List.map(List.range(1, size), (i) => [i, i]));
      yield {
        [0]: () =>
          Array.from(
            { length: 1000 },
            () => 1 + Math.trunc(Math.random() * size),
          ),
        bench: (is) => {
          for (let i = 0; i < 1000; ++i) do_not_optimize(Old.get(dict, is[i]));
        },
      };
    })
      .range("size", 10, 1000000, 10)
      .gc("inner");
    bench(`map.get($size)`, function* (state) {
      const size = state.get("size");
      const dict = new Map(List.map(List.range(1, size), (i) => [i, i]));
      yield {
        [0]: () =>
          Array.from(
            { length: 1000 },
            () => 1 + Math.trunc(Math.random() * size),
          ),
        bench: (is) => {
          for (let i = 0; i < 1000; ++i) do_not_optimize(dict.get(is[i]));
        },
      };
    }).range("size", 10, 1000000, 10);
  });
});

lineplot(() => {
  summary(() => {
    bench(`old.insert($size)`, function* (state) {
      const size = state.get("size");
      const dict = Old.from_list(List.map(List.range(1, size), (i) => [i, i]));
      yield {
        [0]: () => 1 + Math.trunc(Math.random() * 0xffffff),
        bench: (x) => {
          for (let i = -1000; i < 0; ++i) do_not_optimize(Old.insert(dict, x, i));
        },
      };
    })
      .range("size", 10, 1000000, 10)
      .gc("inner");

    bench(`new.insert($size)`, function* (state) {
      const size = state.get("size");
      const dict = New.from_list(List.map(List.range(1, size), (i) => [i, i]));
      yield {
        [0]: () => 1 + Math.trunc(Math.random() * 0xffffff),
        bench: (x) => {
          for (let i = -1000; i < 0; ++i) do_not_optimize(New.insert(dict, x, i));
        },
      };
    })
      .range("size", 10, 1000000, 10)
      .gc("inner");
    bench(`map.insert($size)`, function* (state) {
      const size = state.get("size");
      const dict = new Map(List.map(List.range(1, size), (i) => [i, i]));
      yield {
        [0]: () => Math.trunc(Math.random() * 0xffffff),
        bench: (i) => {
          dict.set(i, i);
          return do_not_optimize(dict);
        },
      };
    })
      .range("size", 10, 1000000, 10)
      .gc("inner");
  });
});

lineplot(() => {
  summary(() => {
    bench(`new.from_list($size)`, function* (state) {
      const size = state.get("size");

      const list = List.map(List.range(1, size), (i) => [i, i]);

      yield {
        [0]: () => list,
        bench: (list) => do_not_optimize(New.from_list(list)),
      };
    })
      .range("size", 10, 1000000, 10)
      .gc("inner");

    bench(`old.from_list($size)`, function* (state) {
      const size = state.get("size");

      const list = List.map(List.range(1, size), (i) => [i, i]);

      yield {
        [0]: () => list,
        bench: (list) => do_not_optimize(Old.from_list(list)),
      };
    })
      .range("size", 10, 1000000, 10)
      .gc("inner");

    bench(`map.from_list($size)`, function* (state) {
      const size = state.get("size");

      const list = List.map(List.range(1, size), (i) => [i, i]);

      yield {
        [0]: () => list,
        bench: (list) => do_not_optimize(new Map(list)),
      };
    })
      .range("size", 10, 1000000, 10)
      .gc("inner");
  });
});

lineplot(() => {
  summary(() => {
    bench(`new.fold($size)`, function* (state) {
      const size = state.get("size");

      const dict = New.from_list(List.map(List.range(1, size), (i) => [i, i]));

      yield {
        [0]: () => dict,
        bench: (dict) => do_not_optimize(New.fold(dict, 0, (s, k, v) => s + v)),
      };
    })
      .range("size", 10, 1000000, 10)
      .gc("inner");

    bench(`old.fold($size)`, function* (state) {
      const size = state.get("size");

      const dict = Old.from_list(List.map(List.range(1, size), (i) => [i, i]));

      yield {
        [0]: () => dict,
        bench: (dict) => do_not_optimize(Old.fold(dict, 0, (s, k, v) => s + v)),
      };
    })
      .range("size", 10, 1000000, 10)
      .gc("inner");

    bench(`map.fold($size)`, function* (state) {
      const size = state.get("size");

      const dict = new Map(List.map(List.range(1, size), (i) => [i, i]));

      yield {
        [0]: () => dict,
        bench: (dict) => {
          let s = 0;
          for (const [k, v] of dict.entries()) {
            s = s + v;
          }
          return do_not_optimize(s);
        },
      };
    })
      .range("size", 10, 1000000, 10)
      .gc("inner");
  });
});

lineplot(() => {
  summary(() => {
    bench(`old.remove($size)`, function* (state) {
      const size = state.get("size");
      const dict = Old.from_list(List.map(List.range(1, size), (i) => [i, i]));
      yield {
        [0]: () => Math.trunc(Math.random() * size) + 1,
        bench: (x) => {
          do_not_optimize(Old.delete$(dict, x));
        },
      };
    })
      .range("size", 10, 1000000, 10)
      .gc("inner");

    bench(`new.remove($size)`, function* (state) {
      const size = state.get("size");
      const dict = New.from_list(List.map(List.range(1, size), (i) => [i, i]));
      yield {
        [0]: () => Math.trunc(Math.random() * size) + 1,
        bench: (x) => {
          do_not_optimize(New.delete$(dict, x));
        },
      };
    })
      .range("size", 10, 1000000, 10)
      .gc("inner");
  });
});

await run();

@yoshi-monster
Copy link
Contributor Author

It appears I used the new FFI syntax while the test runner is still using v1.11. Does the standard library support older Gleam versions as well?

@GearsDatapacks
Copy link
Member

Nope, you can update the CI version and the gleam version in gleam.toml to 1.13. We'll need to update the rest of stdlib FFI at some point too

@GearsDatapacks
Copy link
Member

Ah yes, you'll need to reformat that file to use the latest formatting style also

@yoshi-monster
Copy link
Contributor Author

Yeah, thank you :D

Copy link
Member

@lpil lpil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic! I am so impressed! Thank you

I wonder if there's any further testing we want to do to check for regressions here? Property testing or larger example test suite? Or are we confident enough with the existing tests?


@external(erlang, "maps", "put")
@external(javascript, "../dict.mjs", "put")
fn put(key: k, value: v, transient: TransientDict(k, v)) -> TransientDict(k, v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give this a clearer name please, put is quite nondescript 🙏 One that communicates it will mutate when possible would be fab.

src/dict.mjs Outdated
case COLLISION_NODE:
return withoutCollision(root, key);
}
export function put(key, value, transient) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, a clearer name please 🙏

src/dict.mjs Outdated
}
++i;

function doPut(transient, node, key, value, hash, shift) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better name for this? And also for doRemove?

If not, can we call it insert please, as that's the term we use. put is an Erlang term so good not to mix up our basic terminology.

@yoshi-monster
Copy link
Contributor Author

Thank you!!

The existing tests were already super useful for finding bugs and I thought they were quite good already; for iv most bugs that I had happened whenever the internal structure had to change, especially when the depth of the tree changes. We could maybe do some probabilistic tests around those places where we generate random sequences of inserts/deletes and compare the results to proplists?

I hope the names are better now, I'm really bad at those

~ 💜

@lpil
Copy link
Member

lpil commented Nov 23, 2025

Love the names, thank you.

maybe do some probabilistic tests around those places where we generate random sequences of inserts/deletes and compare the results to proplists?

Could be useful perhaps! What do you think? Do we have confidence to merge this now or do we want to do more beforehand?

@yoshi-monster
Copy link
Contributor Author

Hi! Sorry this took so long.

I added some tests generating a random sequence of insert/delete operations and compares the result to proplists. I ran those a bunch locally and nothing broke (yet), so now I'm relatively confident that nothing will immediately break :)

@lpil
Copy link
Member

lpil commented Dec 5, 2025

idk why you're saying sorry 😁

Copy link
Member

@lpil lpil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such fantastic work!!! Thank you Yoshie! You continue to amaze me!

@inoas-nbw
Copy link

Thank you @yoshi-monster 💚

@lpil lpil merged commit baea2ff into gleam-lang:main Dec 5, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants