Add Support for Bloom Filter for mosaic files by ArnavBalyan · Pull Request #47 · apache/paimon-mosaic

ArnavBalyan · 2026-05-27T08:47:44Z

Add bloom filter support, today all scans/point lookups pushdown ont he reader force a rowgroup scan.
Add bloom filter to per column chunk, which can be used to check for presence of data in the col chunk.
There are 2 main additions:
1. Each column chunk gets it's own bloom filter
2. Each row group gets additional metadata on the location of the bloom filter.
Benchmark : Nyc Taxi dataset (https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page.)
File size: No bloom: 51.10 MiB, With bloom: 56.85 MiB
Small bench was created to test the bytes skipped on a sample query:
- Query IO without bloom: 51.10 MiB
- Query I/O with bloom: 3.19 KB (16000x reduction for absent value lookup)

Bloom filter structure:

Changes in the file format:

Algorithm

We use the Split Block Bloom Filter (SBBF) variant. The bitset is partitioned into 256 bit blocks. Each insert touches exactly one block and sets one bit per word inside that block (8 bits total), so every probe is a single cache line read followed by 8 mask and test operations.
Hashing uses xxHash64 with seed 0. The high half of the hash picks the block (Lemire fast range), the low half drives 8 bit positions via 8 hardcoded salts. The salts and seed match the Apache Parquet bloom spec, so the bit pattern produced by a given input is byte identical across the two formats.

Sizing

The writer takes two knobs per column: an NDV estimate and a target false positive probability (default 0.01). It picks the smallest power of two number of blocks that satisfies the target. At fpp 0.01 this works out to ~10.5 bits per inserted value, so a column with 100k expected NDV ends up around 130 KiB of bloom data per row group.

Where the bloom blob sits in a Mosaic file

For each row group, the writer first emits the bucket data region (K buckets, all columns, compressed as today), and immediately after it emits the bloom region (M blobs, one per bloomed column). The two regions are siblings inside the row group. Bloom blobs are never inlined into a column chunk's bytes, and they are not aggregated into a single trailing section at the end of the file. This means readers can range read a bloom blob without pulling any column data into cache, and writers do not need to seek back over earlier row groups.
The row group index tail gains one new section per bloom: column index, bloom offset, bloom total bytes (all varint encoded, ~12 bytes per blob). Row groups with no blooms cost one byte.

ArnavBalyan · 2026-05-27T08:50:53Z

cc @JingsongLi @XiaoHongbo-Hope ty

ArnavBalyan · 2026-05-28T13:36:17Z

Fixed format issue

ArnavBalyan · 2026-05-31T13:00:09Z

cc @JingsongLi @XiaoHongbo-Hope gentle reminder could you pls take a look when possible thanks!

XiaoHongbo-Hope · 2026-05-31T13:01:32Z

cc @JingsongLi @XiaoHongbo-Hope gentle reminder could you pls take a look when possible thanks!

sure

XiaoHongbo-Hope · 2026-05-31T15:00:31Z

+1

ArnavBalyan · 2026-06-01T01:55:26Z

Thanks a lot! cc @JingsongLi could you also pls review when time permits

update

7292808

ArnavBalyan changed the title ~~[POC] Add Support for Bloom Filter for Mosaic~~ Add Support for Bloom Filter for Mosaic May 27, 2026

ArnavBalyan changed the title ~~Add Support for Bloom Filter for Mosaic~~ Add Support for Bloom Filter May 27, 2026

update

c2c22b4

ArnavBalyan changed the title ~~Add Support for Bloom Filter~~ Add Support for Bloom Filter for mosaic files May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for Bloom Filter for mosaic files#47

Add Support for Bloom Filter for mosaic files#47
ArnavBalyan wants to merge 2 commits into
apache:mainfrom
ArnavBalyan:arnavb/bloom-filter

ArnavBalyan commented May 27, 2026 •

edited

Loading

Uh oh!

ArnavBalyan commented May 27, 2026

Uh oh!

ArnavBalyan commented May 28, 2026

Uh oh!

ArnavBalyan commented May 31, 2026

Uh oh!

XiaoHongbo-Hope commented May 31, 2026

Uh oh!

XiaoHongbo-Hope commented May 31, 2026

Uh oh!

ArnavBalyan commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArnavBalyan commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Algorithm

Sizing

Where the bloom blob sits in a Mosaic file

Uh oh!

ArnavBalyan commented May 27, 2026

Uh oh!

ArnavBalyan commented May 28, 2026

Uh oh!

ArnavBalyan commented May 31, 2026

Uh oh!

XiaoHongbo-Hope commented May 31, 2026

Uh oh!

XiaoHongbo-Hope commented May 31, 2026

Uh oh!

ArnavBalyan commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArnavBalyan commented May 27, 2026 •

edited

Loading