feat(parquet): parquet page store prototype by kszucs · Pull Request #9661 · apache/arrow-rs

kszucs · 2026-04-03T17:23:43Z

Which issue does this PR close?

Tt is a prototype opened for early discussion but aims to close #9592

Rationale for this change

Storing multiple versions of a dataset is expensive. CDC-based page deduplication
can eliminate most of that redundancy with no special storage backend required.

What changes are included in this PR?

parquet::arrow::page_store — PageStoreWriter and PageStoreReader
- Writer re-encodes pages using CDC and writes each as a {blake3}.page blob
  into a shared store directory. Identical pages across files are stored once.
- Reader reassembles data from a lightweight manifest-only Parquet file.
parquet-page-store CLI (page_store,cli features): write, read, reconstruct
parquet/examples/page_store_dedup/ — end-to-end demo on real data (OpenHermes-2.5)

On four variants of an 800 MB dataset (filtered, augmented, appended): 3.1 GB → 563 MB (82% reduction, 5.6×).

Are these changes tested?

Yes — round-trips, multi-page, multi-row-group, nested types, cross-file dedup, page integrity, and reader error cases.

Are there any user-facing changes?

Additive only, gated behind the page_store feature flag (off by default). The API and manifest format are explicitly unstable in this PR.

…pache#9637) The CDC chunker's value_offset diverged from actual leaf array positions when null list entries had non-empty child offset ranges (valid per the Arrow columnar format spec). This caused slice_for_chunk to produce incorrect non_null_indices, leading to an out-of-bounds panic in write_mini_batch. Track non-null value counts (nni) separately from leaf slot counts in the chunker, and use them in slice_for_chunk to correctly index into non_null_indices regardless of gaps in the leaf array.

…n tests

…dtrip verification

kszucs · 2026-04-03T17:24:09Z

depends on #9644

…tore

kszucs · 2026-04-06T15:45:37Z

@alamb opened it for further discussions, we could integrate it more deeply to support a generic store instead of writing/reading directly to local files

cc @etseidl

martindurant · 2026-04-07T14:09:16Z

Some general comments on this effort, without details of the implementation.

First, I think this is a GREAT idea, and something I wish I had had the time to start myself (the kerchunk project had aspirations for byte-range redirection in parquet). It is high time that the parquet format was given modern features beyond what a "layer over parquet" (i.e., iceberg) can do.

it should be mentioned somewhere, that the idea of the PR was (partly?) inspired by the blockwise deduplication possible via Xet. I don't know of another storage system that works quite the same; ipfs has content-addressing at the file level, for instance.
the metadata files are not legal parquet files, they cannot be loaded because they refer to byte ranges than don't exist. This means that none of the data is accessible without the specific code in here and that would pose a problem for adoption, I think. The metadata files essentiually give the schema evolution in a similar way to iceberg.
I think the format of the .page files is literally the binary of each page (header+def+rep+compress(values) ). That should be made clearer. These are also not valid parquet data by themselves
page statistics are (I think) stored only in the pages, but skippig loading a page file would be a great thing to be able to do, so it might make sense to surface these in a central place and even add other per-page information to allow skipping pages when reading
I could imagine combining the latter two points: the parquet files could include all the pages. Right now, the hash is the filename, but you could have the pages in a real parquet file and only need to store the hash->offset,size information to get the benefits of dedup, but still allow the data to be read directly. You would need to make use of the ColumnChunk.file_path value. At least, I think I see a possibility.
(aside) where there is a structure like list[record[required: field1, required: field2]], the def and rep levels for the two leaf fields must be identical, so there are other duplications in the data; the reader should even need to load the second time, the offset/index arrays are the same.
dedup is even more important for remote storage. I realise you might be operating on locally mounted remotes, but direct interaction with remote storage and byte ranges I think should be considered. For instance, listing the 3500 page files of the example pipeline would pose a significant runtime cost.
the idea here might work even better for the feather2 format and perhaps others. For feather2, in-file pointers/links are the norm (flatbuffers style) and instead of def and rep levels, you store the actual validity/index arrays, so they can be directly deduplicated separately. I think feather2 "chunk" sizes are probably more like row-groups than pages though.

alamb · 2026-04-09T12:32:45Z

I wonder if this is something interesting to @XiangpengHao who is working on some other ideas related to ObjectStore APIs to existing systems (e.g. files / io_uring) https://github.com/XiangpengHao/t4

kszucs added 4 commits April 3, 2026 19:20

refactor(parquet): reuse existing write_with_cdc_options in regressio…

4fb27f6

…n tests

feat(parquet): add content-addressed page store with CDC deduplication

3533fd8

feat(parquet): add page store demo, reconstruct CLI command, and roun…

0d34a46

…dtrip verification

github-actions bot added the parquet Changes to the parquet crate label Apr 3, 2026

kszucs added 4 commits April 3, 2026 19:27

chore: add ASF license headers and fix RAT/Prettier CI failures

2e9ee36

fix(parquet): fix clippy warnings in page_store and cdc

735970f

fix(parquet): fix clippy map_or and rustdoc unresolved link in page_s…

de45d6d

…tore

fix(parquet): fix clippy map_or in page_store example

96a4f2e

kszucs mentioned this pull request Apr 6, 2026

doc: explain the parquet cdc feature from a user perspective apache/datafusion#21404

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parquet): parquet page store prototype#9661

feat(parquet): parquet page store prototype#9661
kszucs wants to merge 8 commits intoapache:mainfrom
kszucs:page-store

kszucs commented Apr 3, 2026 •

edited

Loading

Uh oh!

kszucs commented Apr 3, 2026

Uh oh!

kszucs commented Apr 6, 2026 •

edited

Loading

Uh oh!

martindurant commented Apr 7, 2026

Uh oh!

alamb commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kszucs commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kszucs commented Apr 3, 2026

Uh oh!

kszucs commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Apr 7, 2026

Uh oh!

alamb commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kszucs commented Apr 3, 2026 •

edited

Loading

kszucs commented Apr 6, 2026 •

edited

Loading