Skip to content

feat(parquet): parquet page store prototype#9661

Open
kszucs wants to merge 8 commits intoapache:mainfrom
kszucs:page-store
Open

feat(parquet): parquet page store prototype#9661
kszucs wants to merge 8 commits intoapache:mainfrom
kszucs:page-store

Conversation

@kszucs
Copy link
Copy Markdown
Member

@kszucs kszucs commented Apr 3, 2026

page_store_concept

Which issue does this PR close?

Tt is a prototype opened for early discussion but aims to close #9592

Rationale for this change

Storing multiple versions of a dataset is expensive. CDC-based page deduplication
can eliminate most of that redundancy with no special storage backend required.

What changes are included in this PR?

  • parquet::arrow::page_storePageStoreWriter and PageStoreReader
    • Writer re-encodes pages using CDC and writes each as a {blake3}.page blob
      into a shared store directory. Identical pages across files are stored once.
    • Reader reassembles data from a lightweight manifest-only Parquet file.
  • parquet-page-store CLI (page_store,cli features): write, read, reconstruct
  • parquet/examples/page_store_dedup/ — end-to-end demo on real data (OpenHermes-2.5)

On four variants of an 800 MB dataset (filtered, augmented, appended): 3.1 GB → 563 MB (82% reduction, 5.6×).

Are these changes tested?

Yes — round-trips, multi-page, multi-row-group, nested types, cross-file dedup, page integrity, and reader error cases.

Are there any user-facing changes?

Additive only, gated behind the page_store feature flag (off by default). The API and manifest format are explicitly unstable in this PR.

kszucs added 4 commits April 3, 2026 19:20
…pache#9637)

The CDC chunker's value_offset diverged from actual leaf array positions
when null list entries had non-empty child offset ranges (valid per the
Arrow columnar format spec). This caused slice_for_chunk to produce
incorrect non_null_indices, leading to an out-of-bounds panic in
write_mini_batch.

Track non-null value counts (nni) separately from leaf slot counts in
the chunker, and use them in slice_for_chunk to correctly index into
non_null_indices regardless of gaps in the leaf array.
@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 3, 2026
@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Apr 3, 2026

depends on #9644

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Apr 6, 2026

@alamb opened it for further discussions, we could integrate it more deeply to support a generic store instead of writing/reading directly to local files

cc @etseidl

@martindurant
Copy link
Copy Markdown

Some general comments on this effort, without details of the implementation.

First, I think this is a GREAT idea, and something I wish I had had the time to start myself (the kerchunk project had aspirations for byte-range redirection in parquet). It is high time that the parquet format was given modern features beyond what a "layer over parquet" (i.e., iceberg) can do.

  • it should be mentioned somewhere, that the idea of the PR was (partly?) inspired by the blockwise deduplication possible via Xet. I don't know of another storage system that works quite the same; ipfs has content-addressing at the file level, for instance.
  • the metadata files are not legal parquet files, they cannot be loaded because they refer to byte ranges than don't exist. This means that none of the data is accessible without the specific code in here and that would pose a problem for adoption, I think. The metadata files essentiually give the schema evolution in a similar way to iceberg.
  • I think the format of the .page files is literally the binary of each page (header+def+rep+compress(values) ). That should be made clearer. These are also not valid parquet data by themselves
  • page statistics are (I think) stored only in the pages, but skippig loading a page file would be a great thing to be able to do, so it might make sense to surface these in a central place and even add other per-page information to allow skipping pages when reading
  • I could imagine combining the latter two points: the parquet files could include all the pages. Right now, the hash is the filename, but you could have the pages in a real parquet file and only need to store the hash->offset,size information to get the benefits of dedup, but still allow the data to be read directly. You would need to make use of the ColumnChunk.file_path value. At least, I think I see a possibility.
  • (aside) where there is a structure like list[record[required: field1, required: field2]], the def and rep levels for the two leaf fields must be identical, so there are other duplications in the data; the reader should even need to load the second time, the offset/index arrays are the same.
  • dedup is even more important for remote storage. I realise you might be operating on locally mounted remotes, but direct interaction with remote storage and byte ranges I think should be considered. For instance, listing the 3500 page files of the example pipeline would pose a significant runtime cost.
  • the idea here might work even better for the feather2 format and perhaps others. For feather2, in-file pointers/links are the norm (flatbuffers style) and instead of def and rep levels, you store the actual validity/index arrays, so they can be directly deduplicated separately. I think feather2 "chunk" sizes are probably more like row-groups than pages though.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 9, 2026

I wonder if this is something interesting to @XiangpengHao who is working on some other ideas related to ObjectStore APIs to existing systems (e.g. files / io_uring) https://github.com/XiangpengHao/t4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create a "end to end" Content Addressable Storage / CDC Chunking Example

3 participants