Skip to content

Conversation

@RaduBerinde
Copy link
Contributor

During peeling, we can calculate the third index by XORing all known
indexes instead of looking it up in the little table. This yields a
modest improvement.

On an Apple M1:

❯ benchstat /tmp/before.txt /tmp/after.txt
name                               old time/op    new time/op    delta
BinaryFusePopulate/8/n=1000000-10    29.0ms ± 5%    27.3ms ± 1%  -5.77%  (p=0.000 n=10+9)

name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=1000000-10      34.5 ± 5%      36.6 ± 1%  +6.06%  (p=0.000 n=10+9)

On an n4d-standard-8 (AMD Turin):

name                              old time/op    new time/op    delta
BinaryFusePopulate/8/n=1000000-8    34.9ms ± 1%    34.4ms ± 1%  -1.44%  (p=0.000 n=10+9)

name                              old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=1000000-8      28.7 ± 1%      29.1 ± 1%  +1.45%  (p=0.000 n=10+9)

@lemire
Copy link
Member

lemire commented Jan 14, 2026

@RaduBerinde

Ok. I had an AI write the following python script:

#!/usr/bin/env python3
import subprocess
import statistics
import collections

def run_benchmark():
    """Run the benchmark command and return the output."""
    result = subprocess.run(
        ["go", "test", "-run=^$", "-bench=BenchmarkBinaryFusePopulate"],
        capture_output=True,
        text=True
    )
    return result.stdout

def extract_mkeys_per_sec(output):
    """Extract benchmark name and MKeys/s values from the benchmark output."""
    pairs = []
    lines = output.split('\n')
    for line in lines:
        if 'MKeys/s' in line:
            parts = line.split()
            # The benchmark name is the first part, without the suffix
            name = parts[0].rsplit('-', 1)[0]
            # Find the value before 'MKeys/s'
            for i, part in enumerate(parts):
                if part == 'MKeys/s':
                    try:
                        value = float(parts[i-1])
                        pairs.append((name, value))
                    except (ValueError, IndexError):
                        pass
    return pairs

def main():
    mkeys_dict = collections.defaultdict(list)
    num_runs = 10
    
    print(f"Running benchmark {num_runs} times...")
    
    for i in range(num_runs):
        print(f"Run {i+1}/{num_runs}")
        output = run_benchmark()
        pairs = extract_mkeys_per_sec(output)
        for name, val in pairs:
            mkeys_dict[name].append(val)
    
    print("\n")  # New line after dots
    
    for bits in [8, 16]:
        print(f"## {bits}-bit")
        print("| n | Min | Max | Average | Median |")
        print("|-----|-----|-----|---------|--------|")
        for num in [10000, 100000, 1000000]:
            key = f"BenchmarkBinaryFusePopulate/{bits}/n={num}"
            if key in mkeys_dict and mkeys_dict[key]:
                vals = mkeys_dict[key]
                min_val = min(vals)
                max_val = max(vals)
                avg_val = statistics.mean(vals)
                med_val = statistics.median(vals)
                print(f"| {num} | {min_val:.2f} | {max_val:.2f} | {avg_val:.2f} | {med_val:.2f} |")
            else:
                print(f"| {num} | N/A | N/A | N/A | N/A |")
        print()  # Empty line between tables

if __name__ == "__main__":
    main()

It runs the benchmarks 10 times and gives me some stats about the number of millions of keys per second populated.

Using Go 1.25 and an Apple M4...

For your PR I get...

8-bit

n Min Max Average Median
10000 79.03 80.52 79.69 79.59
100000 58.44 60.09 59.18 59.22
1000000 48.55 50.28 49.64 49.82

16-bit

n Min Max Average Median
10000 76.38 78.36 77.28 77.13
100000 53.61 55.28 54.33 54.30
1000000 43.02 45.92 44.96 45.13

From our main branch I get...

8-bit

n Min Max Average Median
10000 71.69 80.71 78.35 79.72
100000 52.98 60.53 58.04 58.31
1000000 47.28 50.74 49.59 49.77

16-bit

n Min Max Average Median
10000 72.80 79.98 77.18 77.91
100000 26.54 58.09 52.94 56.08
1000000 46.38 50.00 48.73 49.05

So, interestingly, I am getting no gain whatsoever in the 8-bit case, but a measurable gain in the 16-bit case when we have lots of data.

Could you run my script or the equivalent and see what you are finding ?

@RaduBerinde
Copy link
Contributor Author

RaduBerinde commented Jan 14, 2026

The output in the PR description was generated with benchstat (go install github.com/cockroachdb/benchstat@latest; it's forked from golang.org/x/perf/cmd/benchstat, I don't remember what we added). I do something like this:

$ go test . -run=- -bench=BinaryFusePopulate -count 10 |& tee after.txt
$ git checkout HEAD^
$ go test . -run=- -bench=BinaryFusePopulate -count 10 |& tee before.txt
$ benchstat before.txt after.txt

It shows the p-value derived from the Mann-Whitney U-test (probability that you'd get the difference by chance).

Anyway, here's the output of your script on my Apple M1:

This PR, 8-bit

n Min Max Average Median
10000 44.06 46.18 45.06 45.13
100000 20.26 40.17 37.81 39.77
1000000 34.02 37.09 36.36 36.60

This PR, 16-bit

n Min Max Average Median
10000 44.94 49.35 47.88 48.68
100000 36.75 39.05 38.64 38.89
1000000 33.09 34.74 34.17 34.41

main, 8-bit

n Min Max Average Median
10000 21.42 47.20 42.67 44.33
100000 35.98 38.49 37.32 37.12
1000000 33.25 35.33 34.13 33.98

main 16-bit

n Min Max Average Median
10000 41.11 46.73 44.10 44.21
100000 32.43 38.01 36.62 36.89
1000000 31.51 33.39 32.51 32.47

It's completely possible that M4's improved architecture makes the local table solution just as good. It is odd that we see an improvement only in the 16-bit case, it shouldn't make the change any more or less effective.

I think server machines are more important to look at. I already posted for GCE AMD-based n4d machines.
Here's for an N4 machine (INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.10GHz):

name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=10000-8         25.9 ± 1%      25.4 ± 3%  -1.72%  (p=0.003 n=10+8)
BinaryFusePopulate/8/n=100000-8        23.3 ± 1%      23.1 ± 1%  -1.04%  (p=0.001 n=7+9)
BinaryFusePopulate/8/n=1000000-8       17.0 ± 2%      16.9 ± 2%    ~     (p=0.529 n=10+8)
BinaryFusePopulate/16/n=10000-8        27.5 ± 1%      26.8 ± 1%  -2.49%  (p=0.000 n=9+10)
BinaryFusePopulate/16/n=100000-8       23.9 ± 1%      24.1 ± 1%  +0.66%  (p=0.020 n=10+10)
BinaryFusePopulate/16/n=1000000-8      16.8 ± 2%      17.0 ± 2%    ~     (p=0.066 n=9+9)

Overall it looks like a small improvement, but it's close enough that it's not too convincing. No problem whatsoever if you decide to close this.

On a more general note, I tried a bunch of potential improvements but the code looks like it is already exceptionally well optimized. I had another small potential improvement - to skip the if t2hash[index1]&t2hash[index2]&t2hash[index3] == 0 duplicate test in the first iteration.. but it was even less than this PR :)

During peeling, we can calculate the third index by XORing all known
indexes instead of looking it up in the little table. This yields a
modest improvement.

On an Apple M1:
```
❯ benchstat /tmp/before.txt /tmp/after.txt
name                               old time/op    new time/op    delta
BinaryFusePopulate/8/n=1000000-10    29.0ms ± 5%    27.3ms ± 1%  -5.77%  (p=0.000 n=10+9)

name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=1000000-10      34.5 ± 5%      36.6 ± 1%  +6.06%  (p=0.000 n=10+9)
```

On an n4d-standard-8 (AMD Turin):
```
name                              old time/op    new time/op    delta
BinaryFusePopulate/8/n=1000000-8    34.9ms ± 1%    34.4ms ± 1%  -1.44%  (p=0.000 n=10+9)

name                              old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=1000000-8      28.7 ± 1%      29.1 ± 1%  +1.45%  (p=0.000 n=10+9)
```
@RaduBerinde
Copy link
Contributor Author

Rebased onto the newest main.

@RaduBerinde
Copy link
Contributor Author

I was experimenting and hit a go bug 😆 golang/go#77181

@RaduBerinde
Copy link
Contributor Author

RaduBerinde commented Jan 14, 2026

What do you think of something like this instead: https://github.com/RaduBerinde/xorfilter/pull/new/other-index-2

I get a more significant improvement on n4d (AMD Turin) but a smaller improvement on M1:
M1:

name                               old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=1000000-10      36.6 ± 1%      37.9 ± 1%  +3.67%  (p=0.000 n=9+10)

N4d:

name                              old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=1000000-8      28.6 ± 2%      32.2 ± 2%  +12.66%  (p=0.000 n=10+10)

On N4 (Intel):

name                              old MKeys/s    new MKeys/s    delta
BinaryFusePopulate/8/n=1000000-8      20.6 ± 3%      22.9 ± 3%  +11.01%  (p=0.000 n=10+10)

@RaduBerinde
Copy link
Contributor Author

I am working on a branchless version of the above. Let me close this for now.

@lemire
Copy link
Member

lemire commented Jan 15, 2026

No problem whatsoever if you decide to close this.

Just to be clear, I wasn't about to close this or reject it or any such thing.

:-)

You closed it.

Ok, let us move to the other PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants