-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Misc hash / hash aggregation performance improvements #19910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
run benchmarks |
|
🤖 |
| let entry = self.map.find_mut(hash, |header| { | ||
| // compare value if hashes match | ||
| if header.len != value_len { | ||
| if header.hash != hash { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment says "compare value if hashes match" but actual implementation is comparing lengths 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am double checking the reasoning, if the hash values aren't the same we know the values can not be the same either due to the requirements of the Hash trait
However (as codex points out to me) there is some small subtle chance that if two different values collide on hash, and their inline encodings match, they will be treated as equal even though the byte sequences differ which is what the old len check avoided
So I think we still need the old length checks too 🤔
| let entry = self.map.find_mut(hash, |header| { | ||
| // compare value if hashes match | ||
| if header.len != value_len { | ||
| if header.hash != hash { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should save some random access if a lot of values share the same length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to update this place to check header.len
| let v = self.builder.get_value(header.view_idx); | ||
|
|
||
| if v.len() != value.len() { | ||
| if header.hash != hash { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should save some random access if a lot of values share the same length
|
run benchmark sql_aggregation |
|
🤖 Hi @Dandandan, thanks for the request (#19910 (comment)).
Please choose one or more of these with |
|
run benchmark aggregate_query_sql aggregate_vectorized |
|
🤖: Benchmark completed Details
|
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
Benchmark script failed with exit code 101. Last 10 lines of output: Click to expand |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Unfortunately the benchmark runner is very noisy lately 😢 |
|
run benchmark tpch tpcds |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark tpch_mem |
|
🤖 |
|
🤖: Benchmark completed Details
|
| hash, | ||
| |&(g, _)| unsafe { self.values.get_unchecked(g).is_eq(key) }, | ||
| |&(g, h)| unsafe { | ||
| hash == h && self.values.get_unchecked(g).is_eq(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked that Hashbrown says
https://docs.rs/hashbrown/latest/hashbrown/struct.HashTable.html#method.entry
This method will call eq for all entries with the given hash, but may also call it for entries with a different hash. eq should only return true for the desired entry, at which point the search is stopped.
So checking the hash first makes senes to me as it will avoid a memory access
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
datafusion/common/src/utils/proxy.rs
Outdated
| self.reserve(bump_elements, &hasher); | ||
| } | ||
|
|
||
| // still need to insert the element since first try failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a first attempt to insert -- maybe this comment needs to be updated
|
Thank you @Dandandan |
| if header.hash != hash { | ||
| return false; | ||
| } | ||
| let v = self.builder.get_value(header.view_idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw in profiles this is pretty expensive (also self.builder.append_value and always comparing by bytes - I think we are not really using the potential of binary views here, I opened #19961
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
Which issue does this PR close?
Rationale for this change
Just a couple of optimizations for hash table lookups usage in hash aggregate.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?