-
Notifications
You must be signed in to change notification settings - Fork 53
Minor performance improvement in binary fuse build #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Ok. I had an AI write the following python script: #!/usr/bin/env python3
import subprocess
import statistics
import collections
def run_benchmark():
"""Run the benchmark command and return the output."""
result = subprocess.run(
["go", "test", "-run=^$", "-bench=BenchmarkBinaryFusePopulate"],
capture_output=True,
text=True
)
return result.stdout
def extract_mkeys_per_sec(output):
"""Extract benchmark name and MKeys/s values from the benchmark output."""
pairs = []
lines = output.split('\n')
for line in lines:
if 'MKeys/s' in line:
parts = line.split()
# The benchmark name is the first part, without the suffix
name = parts[0].rsplit('-', 1)[0]
# Find the value before 'MKeys/s'
for i, part in enumerate(parts):
if part == 'MKeys/s':
try:
value = float(parts[i-1])
pairs.append((name, value))
except (ValueError, IndexError):
pass
return pairs
def main():
mkeys_dict = collections.defaultdict(list)
num_runs = 10
print(f"Running benchmark {num_runs} times...")
for i in range(num_runs):
print(f"Run {i+1}/{num_runs}")
output = run_benchmark()
pairs = extract_mkeys_per_sec(output)
for name, val in pairs:
mkeys_dict[name].append(val)
print("\n") # New line after dots
for bits in [8, 16]:
print(f"## {bits}-bit")
print("| n | Min | Max | Average | Median |")
print("|-----|-----|-----|---------|--------|")
for num in [10000, 100000, 1000000]:
key = f"BenchmarkBinaryFusePopulate/{bits}/n={num}"
if key in mkeys_dict and mkeys_dict[key]:
vals = mkeys_dict[key]
min_val = min(vals)
max_val = max(vals)
avg_val = statistics.mean(vals)
med_val = statistics.median(vals)
print(f"| {num} | {min_val:.2f} | {max_val:.2f} | {avg_val:.2f} | {med_val:.2f} |")
else:
print(f"| {num} | N/A | N/A | N/A | N/A |")
print() # Empty line between tables
if __name__ == "__main__":
main()It runs the benchmarks 10 times and gives me some stats about the number of millions of keys per second populated. Using Go 1.25 and an Apple M4... For your PR I get... 8-bit
16-bit
From our main branch I get... 8-bit
16-bit
So, interestingly, I am getting no gain whatsoever in the 8-bit case, but a measurable gain in the 16-bit case when we have lots of data. Could you run my script or the equivalent and see what you are finding ? |
|
The output in the PR description was generated with It shows the p-value derived from the Mann-Whitney U-test (probability that you'd get the difference by chance). Anyway, here's the output of your script on my Apple M1: This PR, 8-bit
This PR, 16-bit
main, 8-bit
main 16-bit
It's completely possible that M4's improved architecture makes the local table solution just as good. It is odd that we see an improvement only in the 16-bit case, it shouldn't make the change any more or less effective. I think server machines are more important to look at. I already posted for GCE AMD-based Overall it looks like a small improvement, but it's close enough that it's not too convincing. No problem whatsoever if you decide to close this. On a more general note, I tried a bunch of potential improvements but the code looks like it is already exceptionally well optimized. I had another small potential improvement - to skip the |
During peeling, we can calculate the third index by XORing all known indexes instead of looking it up in the little table. This yields a modest improvement. On an Apple M1: ``` ❯ benchstat /tmp/before.txt /tmp/after.txt name old time/op new time/op delta BinaryFusePopulate/8/n=1000000-10 29.0ms ± 5% 27.3ms ± 1% -5.77% (p=0.000 n=10+9) name old MKeys/s new MKeys/s delta BinaryFusePopulate/8/n=1000000-10 34.5 ± 5% 36.6 ± 1% +6.06% (p=0.000 n=10+9) ``` On an n4d-standard-8 (AMD Turin): ``` name old time/op new time/op delta BinaryFusePopulate/8/n=1000000-8 34.9ms ± 1% 34.4ms ± 1% -1.44% (p=0.000 n=10+9) name old MKeys/s new MKeys/s delta BinaryFusePopulate/8/n=1000000-8 28.7 ± 1% 29.1 ± 1% +1.45% (p=0.000 n=10+9) ```
787d8e2 to
d75fdf8
Compare
|
Rebased onto the newest main. |
|
I was experimenting and hit a go bug 😆 golang/go#77181 |
|
What do you think of something like this instead: https://github.com/RaduBerinde/xorfilter/pull/new/other-index-2 I get a more significant improvement on n4d (AMD Turin) but a smaller improvement on M1: N4d: On N4 (Intel): |
|
I am working on a branchless version of the above. Let me close this for now. |
Just to be clear, I wasn't about to close this or reject it or any such thing. :-) You closed it. Ok, let us move to the other PR. |
During peeling, we can calculate the third index by XORing all known
indexes instead of looking it up in the little table. This yields a
modest improvement.
On an Apple M1:
On an n4d-standard-8 (AMD Turin):