Skip to content

feat(parquet): add content defined chunking for arrow writer#9450

Open
kszucs wants to merge 15 commits intoapache:mainfrom
kszucs:content-defined-chunking
Open

feat(parquet): add content defined chunking for arrow writer#9450
kszucs wants to merge 15 commits intoapache:mainfrom
kszucs:content-defined-chunking

Conversation

@kszucs
Copy link
Member

@kszucs kszucs commented Feb 20, 2026

Which issue does this PR close?

  • Closes #NNN.

Rationale for this change

Rust implementation of apache/arrow#45360

Traditional Parquet writing splits data pages at fixed sizes, so a single inserted or deleted row causes all subsequent pages to shift — resulting in nearly every byte being re-uploaded to content-addressable storage (CAS) systems. CDC determines page boundaries via a rolling gearhash over column values, so unchanged data produces identical pages across different writes enabling storage cost reductions and faster upload times.

See more details in https://huggingface.co/blog/parquet-cdc

The original C++ implementation apache/arrow#45360

Evaluation tool https://github.com/huggingface/dataset-dedupe-estimator where I already integrated this PR to verify that deduplication effectiveness is on par with parquet-cpp (lower is better):

image

What changes are included in this PR?

  • Content-defined chunker at parquet/src/column/chunker/
  • Arrow writer integration integrated in ArrowColumnWriter
  • Writer properties via CdcOptions struct (min_chunk_size, max_chunk_size, norm_level)
  • ColumnDescriptor: added repeated_ancestor_def_level field to for nested field values iteration

Are these changes tested?

Yes — unit tests are located in cdc.rs and ported from the C++ implementation.

Are there any user-facing changes?

New experimental API, disabled by default — no behavior change for existing code:

// Simple toggle (256 KiB min, 1 MiB max, norm_level 0)
let props = WriterProperties::builder()
    .set_content_defined_chunking(true)
    .build();

// Excpliti CDC parameters
let props = WriterProperties::builder()
    .set_cdc_options(CdcOptions { min_chunk_size: 128 * 1024, max_chunk_size: 512 * 1024, norm_level: 1 })
    .build();

@github-actions github-actions bot added the parquet Changes to the parquet crate label Feb 20, 2026
@kszucs kszucs marked this pull request as ready for review February 25, 2026 08:12
@kszucs kszucs requested review from alamb and etseidl February 25, 2026 11:19
let mut path: Vec<String> = vec![];
path.extend(path_so_far.iter().copied().map(String::from));
leaves.push(Arc::new(ColumnDescriptor::new(
let mut desc = ColumnDescriptor::new(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to break the API of ColumnDescriptor, so setting repeated_ancestor_def_level below.

Copy link
Member Author

@kszucs kszucs Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't necessarily need to store the codegen script in the repository. Alternatively we could just reference https://github.com/apache/arrow/blob/main/cpp/src/parquet/chunker_internal_generated.h as a source for cdc_generated.rs. Likely it won't be regenerated at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant