Reduce heap allocations #28

Vindaar · 2025-12-18T16:42:33Z

After looking at performance profiling yesterday, I decided to have a look at heap profiling today. I didn't expect to find much that actually affects performance, but I did find some unnecessary heap allocations. The new code is barely longer.

Essentially we now avoid heap allocations in:

poseidon_sponge
apply in 2/3 code branches (cannot easily in the general hashing of N runtime values)
compute_tree_leaves now avoids reallocating the packed_leaf_input vector for each rayon job. We now use the thread_local crate to have thread local storage. Each thread only allocates the vector once. Due to the fact that we overwrite the vector fully anyway, we don't even need to set it back to zero. Initially here I used rayon's for_each_init. While that works, it does not guarantee to only init once per worker, because of how rayon does work stealing and splitting: for_each_init calls init several times per thread rayon-rs/rayon#742

The code that used for_each_init is still here in commit: Vindaar@e3e5ceb

I've added two simple examples, which only do the keygen step for 2^8 and 2^32 elements to make it easier to only heap profile and performance profile that and only that.

For the 2^8 case before these fixes there were about ~1900 total allocations. Of those we get rid of ~160. In the 2^32 case we go from about 58000 temporary allocations to 41.

Heaptrack output for 2^32 before

Heaptrack output for 2^32 after

Note on performance

As one can expect, malloc / memmove etc were only small fractions visible in the performance profiling #27. As a result it is no surprise that the performance is essentially identical for the current benchmark setup we have. But depending on how the code is used in other setups and given that the code is not significantly more complex, it may still be considered a win.

In the 2^32 benchmark during key generation this avoids about 500k temporary heap allocations when running for about 30s. Likely only a small performance cost, but we can avoid them without making the code much more complicated.

Each rayon worker job had to allocate the full `packed_leaf_input`. We now use `for_each_init` to preallocate a vector for every Rayon worker instead. We overwrite the entire vector in every job, so not even a need to `fill(0)` the vector in each job. This drops another ~100k allocations when running the 2^32 bench over 30s. Brings us down to only 3k temporary allocations total in that time frame.

This way we essentially avoid all allocations, i.e. we get a single allocation per thread. `for_each_init` is known to allocate multiple times due to the rayon work stealing / splitting approach. See: rayon-rs/rayon#742

No need for a `Vec` in these two branches as we know at compile time how much data is required for each input. Only relevant if `apply` is part of a hot code path, which normally is unlikely to be the case. Still, the code is not significantly more, only more ugly :( It gets rid of a large number of allocations when running the 2^8 benchmark case.

Can't hurt to have this in here.

Somehow this is a case where cargo fmt has no opinion about it. Earlier when using `for_each_init` the indentation was changed, but this part didn't want to "come back" to what it was before...

following the benchmarks for the smallest and largest case

tcoratger · 2025-12-20T23:00:55Z

examples/single_keygen.rs

I think we can remove this file no? Just used for the benchmark I imagine but shouldn't be pushed to main no?

Sure, we can do that. I personally find it nice to have files for easy profiling like that, but they are also easy to write again. I'll remove them.

tcoratger · 2025-12-20T23:01:01Z

examples/single_keygen_2_32.rs

I think we can remove this file no? Just used for the benchmark I imagine but shouldn't be pushed to main no?

src/symmetric/tweak_hash/poseidon.rs

tcoratger · 2025-12-20T23:11:26Z

Cargo.toml

 p3-koala-bear = { git = "https://github.com/Plonky3/Plonky3.git", rev = "a33a312" }
 p3-symmetric = { git = "https://github.com/Plonky3/Plonky3.git", rev = "a33a312" }

+thread_local = "1.1.9"


I find it very risky to rely on this kind of external crate. We should minimize the number of external deps and this one looks like a personal project so I find it pretty risky, would love to avoid this and I would prefer another alternative such as for_each_init or something equivalent in std.

Yeah, that's what I thought and why I kept the for_each_init approach. I'll have a look at the thread_local std package https://doc.rust-lang.org/std/macro.thread_local.html.

Using the stdlib macro now.

Vindaar · 2025-12-22T17:22:49Z

Pushed the changes addressing all comments!

tcoratger · 2025-12-23T22:55:12Z

src/symmetric/tweak_hash/poseidon.rs

+        for (i, x) in it.remainder().iter().enumerate() {
+            state[i] += *x;
+        }
+        // was a remainder, so permute. No need to mutate `state` as we *add* only anyway


What do you mean here by "no need to mutate state"? Since you pass a mutable reference of state to the permutation?

The point is that before we zero pad. but because one only adds to the state, we don't have to add zeroes for the previously padded data.

I agree with @tcoratger that this comment needs to be modified

tcoratger · 2025-12-23T23:09:45Z

src/symmetric/tweak_hash/poseidon.rs

+        if out_idx < OUT_LEN {
+            perm.permute_mut(&mut state);
+        }


Here to match exactly the logic we had before I think that we don't need the if statement. Here this piece of code is an important factor for security and so we should not diminish the number of permutations. Due to the if after out_idx += chunk_size I think that the number of permutations was not exactly the same.

A good exercise to do for this kind of sensitive refactoring is to check the outputs of a call before and after the modification, they should be the same since we didn't change any logic here.

Suggested change

if out_idx < OUT_LEN {

perm.permute_mut(&mut state);

}

perm.permute_mut(&mut state);

maybe I screwed that up. Will check next year, sorry.

So, finally had a look at this again.

The point is that we don't need to do that permutation, because state is purely a local variable. The permutation would happen a final time after the last data has been copied into out. Yes, the code on main does that, but to absolutely no effect. The output data is unchanged. And since state is not passed into the function, we don't mutate anything relating to further calculations either. We just get rid of a useless permutation.

When I wrote the code I read `extra_elements` as the remainder elements and not the elements to be padded. Oops. In the previous commit `remainder` would be non zero when the input length was an exact multiple of the rate. Cleaner to just use the remainder iterator directly here too and get rid of the variable.

Vindaar · 2026-01-07T14:07:10Z

Addressed your comment about the permutations and fixed an actual bug in the remainder handling. I mistook extra_elements and ended up doing an empty permutation when the input length was an exact multiple of the rate. My if remainder > 0 branch would be taken when there actually was no remainder.

tcoratger

Thanks looks good to me, I compared locally to check that the new poseidon_sponge function gives the exact same result as the old one for all the test suite, this looks good.

tcoratger · 2026-01-07T19:20:49Z

@b-wagn As it touches the poseidon_sponge function can you double check that you don't see anything weird?

b-wagn · 2026-01-08T08:03:00Z

src/symmetric/tweak_hash/poseidon.rs

+
+    // 3. squeeze
+    let mut out = [A::ZERO; OUT_LEN];
+    let mut out_idx = 0;


throughout the codebase, we mostly used "index" and not "idx".
We should try to keep such conventions

b-wagn · 2026-01-08T08:06:56Z

src/symmetric/tweak_hash/poseidon.rs

+    // 1. fill in all full chunks and permute
+    let mut it = input.chunks_exact(rate);
+    for chunk in &mut it {
+        // iterate the chunks


I don't think this comment helps in any way tbh. What does "iterate the chunks" mean? It was easier for me to read the line of code below

Maybe: "add the chunk elements into the first rate many elements of the state."

b-wagn

Left a few comments on the Poseidon Sponge. I think it would be good to get comments from @khovratovich as well, as he wrote the function initially.

khovratovich · 2026-01-14T11:59:00Z

src/symmetric/tweak_hash/poseidon.rs

-    let mut out = vec![];
-    while out.len() < OUT_LEN {
-        out.extend_from_slice(&state[..rate]);
+    // 2. fill the remainder and extend with zeros


I would modify as follows:
2. Fill the remainder and pad with zeros. NOTE: this padding is secure for constant-size inputs (as in this application) but may be insecure elsewhere.

Vindaar added 9 commits December 18, 2025 12:10

avoid heap allocations in poseidon_sponge

ba510ab

In the 2^32 benchmark during key generation this avoids about 500k temporary heap allocations when running for about 30s. Likely only a small performance cost, but we can avoid them without making the code much more complicated.

alternative implementation using thread local storage

ed2f132

This way we essentially avoid all allocations, i.e. we get a single allocation per thread. `for_each_init` is known to allocate multiple times due to the rayon work stealing / splitting approach. See: rayon-rs/rayon#742

add profiling Cargo profile

41d240e

Can't hurt to have this in here.

cargo fmt fixes

816fbbe

remove dead line & update comment

9453208

fix indentation of inner for loop

a1abd1e

Somehow this is a case where cargo fmt has no opinion about it. Earlier when using `for_each_init` the indentation was changed, but this part didn't want to "come back" to what it was before...

[examples] add two examples for key gen for 2^8 and 2^32 elements

7d7d0aa

following the benchmarks for the smallest and largest case

tcoratger reviewed Dec 20, 2025

View reviewed changes

Vindaar added 3 commits December 22, 2025 16:23

use iterator approach when adding chunks to state

d01fa2c

delete keygen examples / profiling helpers

ceab87d

use stdlib thread_local! macro instead of thread_local crate

1305e33

tcoratger reviewed Dec 23, 2025

View reviewed changes

Vindaar added 2 commits January 7, 2026 15:03

add comment about why permutation is unnecessary

e3c9316

tcoratger approved these changes Jan 7, 2026

View reviewed changes

b-wagn reviewed Jan 8, 2026

View reviewed changes

improve doc comment & rename out_idx -> out_index

9738a19

khovratovich reviewed Jan 14, 2026

View reviewed changes

Reduce heap allocations #28

Are you sure you want to change the base?

Reduce heap allocations #28

Uh oh!

Conversation

Vindaar commented Dec 18, 2025

Heaptrack output for 2^32 before

Heaptrack output for 2^32 after

Note on performance

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vindaar commented Dec 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vindaar commented Jan 7, 2026

Uh oh!

tcoratger left a comment

Choose a reason for hiding this comment

Uh oh!

tcoratger commented Jan 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

b-wagn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants