-
Notifications
You must be signed in to change notification settings - Fork 324
Open
Description
In Apache DataFusion we are using hashbrown extensively and I noticed in hash aggregation when the hash table (we use the raw API there) become very large we are memory bound so a natural solution would be to have some kind of prefetching (FYI we are using stable Rust), but there is no way to do that currently
We have something like this (simplified):
for (row, &target_hash) in batch_hashes.iter().enumerate() {
let entry = self.map.entry(
target_hash,
// eq
|(exist_hash, group_idx)| target_hash == *exist_hash && group_rows.row(row) == group_values.row(*group_idx),
// hasher
|(hash, _)| *hash);
let group_idx = match entry {
// Existing group_index for this group value
Entry::Occupied(o) => {
let (_hash, group_idx) = o.get();
*group_idx
}
// Need to create new entry for the group
Entry::Vacant(v) => {
// Add new entry to aggr_state and save newly created index
let group_idx = group_values.num_rows();
group_values.push(group_rows.row(row));
v.insert((target_hash, group_idx));
group_idx
}
};
groups.push(group_idx);
}Allowing to prefetch from the map can help performance for our case.
Because we insert to the table on missing we would have to reserve the x amount of items beforehand for the prefetching to be valuable
Metadata
Metadata
Assignees
Labels
No labels