Add Support for Bloom Filter for mosaic files#47
Open
ArnavBalyan wants to merge 2 commits into
Open
Conversation
Member
Author
|
cc @JingsongLi @XiaoHongbo-Hope ty |
Member
Author
|
Fixed format issue |
Member
Author
|
cc @JingsongLi @XiaoHongbo-Hope gentle reminder could you pls take a look when possible thanks! |
sure |
|
+1 |
Member
Author
|
Thanks a lot! cc @JingsongLi could you also pls review when time permits |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bloom filter structure:

Changes in the file format:
Algorithm
We use the Split Block Bloom Filter (SBBF) variant. The bitset is partitioned into 256 bit blocks. Each insert touches exactly one block and sets one bit per word inside that block (8 bits total), so every probe is a single cache line read followed by 8 mask and test operations.
Hashing uses xxHash64 with seed 0. The high half of the hash picks the block (Lemire fast range), the low half drives 8 bit positions via 8 hardcoded salts. The salts and seed match the Apache Parquet bloom spec, so the bit pattern produced by a given input is byte identical across the two formats.
Sizing
Where the bloom blob sits in a Mosaic file
For each row group, the writer first emits the bucket data region (K buckets, all columns, compressed as today), and immediately after it emits the bloom region (M blobs, one per bloomed column). The two regions are siblings inside the row group. Bloom blobs are never inlined into a column chunk's bytes, and they are not aggregated into a single trailing section at the end of the file. This means readers can range read a bloom blob without pulling any column data into cache, and writers do not need to seek back over earlier row groups.
The row group index tail gains one new section per bloom: column index, bloom offset, bloom total bytes (all varint encoded, ~12 bytes per blob). Row groups with no blooms cost one byte.