Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Sep 10, 2025

@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 10, 2025
@rok rok force-pushed the plaintext_statistics branch from 241c532 to a5a44f4 Compare September 23, 2025 19:50
@rok rok force-pushed the plaintext_statistics branch 2 times, most recently from 3a7de46 to e2ebc45 Compare October 28, 2025 09:14
@rok rok force-pushed the plaintext_statistics branch 2 times, most recently from 03e52aa to 5459ffc Compare November 4, 2025 19:54
@rok rok force-pushed the plaintext_statistics branch 2 times, most recently from 078b622 to 686b38f Compare November 12, 2025 00:44
@rok rok marked this pull request as ready for review November 19, 2025 14:11
@rok rok requested a review from adamreeve November 19, 2025 16:37
Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good thanks Rok, I think this can just be tidied up a bit to reduce some duplication

let is_footer_encrypted = file_encryptor.properties().encrypt_footer();

if !is_footer_encrypted {
// Temporarily clear crypto_metadata so statistics get included in encrypted blob
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this step if we check for encrypted_column_metadata instead of crypto_metadata in serialize_column_meta_data (https://github.com/apache/arrow-rs/pull/8305/files#r2543948275)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed and removed.

@rok
Copy link
Member Author

rok commented Dec 3, 2025

Thanks for the review @adamreeve and nice test suggestions, more explicitly presented cases help reason about this.
I think I covered all comments and this is ready for another round of review.

Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks Rok! Just one minor comment

@rok
Copy link
Member Author

rok commented Dec 3, 2025

@etseidl would you have time to give this a quick look before we merge?

@rok rok requested a review from etseidl December 3, 2025 09:07
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rok and @adamreeve, this looks good to me. Just one question (because I'm too lazy to read the spec >.<).

if let Some(page_encoding_stats) = column_chunk.page_encoding_stats() {
last_field_id = page_encoding_stats.write_thrift_field(w, 13, last_field_id)?;

if should_write_column_stats(column_chunk) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea here to only write the required bits of the column meta_data so readers won't complain? If so, then should all of the fields below also be skipped?

Copy link
Contributor

@adamreeve adamreeve Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I agree all the below fields should be skipped too.

I checked what the C++ implementation does and it's actually a bit different between the plaintext footer case and when the footer is encrypted but the column is encrypted with a different key: https://github.com/apache/arrow/blob/cbd36b817fc77812f8df1a15bf24314de3b27f29/cpp/src/parquet/metadata.cc#L1748-L1755.

When there's a plaintext footer, only the statistics and encoding_stats are stripped out of the unencrypted metadata, similar to what we're proposing here. But when the footer is also encrypted, it's assumed that the reader can handle when the whole metadata field isn't set so this is completely skipped, which is what was done before.

It might make sense to match what C++ does and keep the previous behaviour of excluding the full ColumnChunk metadata field when the footer is encrypted, and check that the reader can handle this when it doesn't have the column key. That should have the benefit of not increasing the footer size too much. And then in the plaintext footer case, also exclude the other fields below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @etseidl !
I've moved the other sensitive metadata into the conditional and added tests. (except for geospatial since it requires #[cfg(feature = "geospatial")] and it seems straightforward enough to not cover here?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rok, looks good now. I'll leave it to @adamreeve if we want to revert to excluding the meta_data entirely when the footer is encrypted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per @adamreeve suggestion this now omits metadata in case encrypted metadata is present and strips out stats if it is not.

Comment on lines 761 to 765
// We always encrypt column metadata separately and store in
// encrypted_column_metadata. This allows skipping the plaintext meta_data
// field in the writer to reduce footer size. If encrypted_column_metadata
// were not set, the reader would not be able to read the column metadata.
Some(file_encryptor.get_footer_encryptor()?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right approach. If the footer is already encrypted we shouldn't need to double-encrypt the column metadata. This goes against what the spec describes and might be incompatible with some readers, plus it will require extra decryption time when reading.

If it's too complex to wire through the information about whether the footer is a plaintext or not to serialize_column_meta_data, I think we should stick with what you had before, or go back to what we currently do in the main branch where the full column metadata is skipped rather than just the stats. My understanding is that writing the metadata in plaintext without the stats was for backwards compatibility with readers that expect the column metadata to always be present when encryption was first being added. But that might not be such a concern now if readers know that field is optional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot to reply here. I've introduced bool ColumnChunkMeteData.plaintext_footer_mode that now helps writer decide if it should write metadata or not. See change.

@rok rok force-pushed the plaintext_statistics branch from a6a3ec2 to 5e195dc Compare December 23, 2025 21:09
Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me now thanks Rok, and it's nice that we match what the C++ library does.

@rok rok force-pushed the plaintext_statistics branch from bb283d4 to 2bf1e29 Compare January 7, 2026 14:35
@rok rok requested a review from etseidl January 7, 2026 14:42
@rok
Copy link
Member Author

rok commented Jan 7, 2026

Happy new year @etseidl !
I think we're happy with this now, feel free to do another review when you have time.

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy New Year @rok! Looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Parquet] Provide only encrypted column stats in plaintext footer

3 participants