Skip to content

Allow to prefetch buckets #677

@rluvaton

Description

@rluvaton

In Apache DataFusion we are using hashbrown extensively and I noticed in hash aggregation when the hash table (we use the raw API there) become very large we are memory bound so a natural solution would be to have some kind of prefetching (FYI we are using stable Rust), but there is no way to do that currently

We have something like this (simplified):

for (row, &target_hash) in batch_hashes.iter().enumerate() {
    let entry = self.map.entry(
        target_hash,
        // eq
        |(exist_hash, group_idx)| target_hash == *exist_hash && group_rows.row(row) == group_values.row(*group_idx),
        // hasher
        |(hash, _)|  *hash);

    let group_idx = match entry {
        // Existing group_index for this group value
        Entry::Occupied(o) => {
            let (_hash, group_idx) = o.get();  
            
            *group_idx
        }
        // Need to create new entry for the group
        Entry::Vacant(v) => {
            // Add new entry to aggr_state and save newly created index
            let group_idx = group_values.num_rows();
            group_values.push(group_rows.row(row));
            v.insert((target_hash, group_idx));

            group_idx
        }
    };
    groups.push(group_idx);
}

Allowing to prefetch from the map can help performance for our case.

Because we insert to the table on missing we would have to reserve the x amount of items beforehand for the prefetching to be valuable

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions