-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Parquet] Provide only encrypted column stats in plaintext footer #8305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
241c532 to
a5a44f4
Compare
3a7de46 to
e2ebc45
Compare
03e52aa to
5459ffc
Compare
078b622 to
686b38f
Compare
adamreeve
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good thanks Rok, I think this can just be tidied up a bit to reduce some duplication
parquet/src/file/metadata/writer.rs
Outdated
| let is_footer_encrypted = file_encryptor.properties().encrypt_footer(); | ||
|
|
||
| if !is_footer_encrypted { | ||
| // Temporarily clear crypto_metadata so statistics get included in encrypted blob |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this step if we check for encrypted_column_metadata instead of crypto_metadata in serialize_column_meta_data (https://github.com/apache/arrow-rs/pull/8305/files#r2543948275)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed and removed.
|
Thanks for the review @adamreeve and nice test suggestions, more explicitly presented cases help reason about this. |
adamreeve
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks Rok! Just one minor comment
|
@etseidl would you have time to give this a quick look before we merge? |
etseidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rok and @adamreeve, this looks good to me. Just one question (because I'm too lazy to read the spec >.<).
| if let Some(page_encoding_stats) = column_chunk.page_encoding_stats() { | ||
| last_field_id = page_encoding_stats.write_thrift_field(w, 13, last_field_id)?; | ||
|
|
||
| if should_write_column_stats(column_chunk) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea here to only write the required bits of the column meta_data so readers won't complain? If so, then should all of the fields below also be skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I agree all the below fields should be skipped too.
I checked what the C++ implementation does and it's actually a bit different between the plaintext footer case and when the footer is encrypted but the column is encrypted with a different key: https://github.com/apache/arrow/blob/cbd36b817fc77812f8df1a15bf24314de3b27f29/cpp/src/parquet/metadata.cc#L1748-L1755.
When there's a plaintext footer, only the statistics and encoding_stats are stripped out of the unencrypted metadata, similar to what we're proposing here. But when the footer is also encrypted, it's assumed that the reader can handle when the whole metadata field isn't set so this is completely skipped, which is what was done before.
It might make sense to match what C++ does and keep the previous behaviour of excluding the full ColumnChunk metadata field when the footer is encrypted, and check that the reader can handle this when it doesn't have the column key. That should have the benefit of not increasing the footer size too much. And then in the plaintext footer case, also exclude the other fields below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch @etseidl !
I've moved the other sensitive metadata into the conditional and added tests. (except for geospatial since it requires #[cfg(feature = "geospatial")] and it seems straightforward enough to not cover here?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rok, looks good now. I'll leave it to @adamreeve if we want to revert to excluding the meta_data entirely when the footer is encrypted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per @adamreeve suggestion this now omits metadata in case encrypted metadata is present and strips out stats if it is not.
parquet/src/file/metadata/writer.rs
Outdated
| // We always encrypt column metadata separately and store in | ||
| // encrypted_column_metadata. This allows skipping the plaintext meta_data | ||
| // field in the writer to reduce footer size. If encrypted_column_metadata | ||
| // were not set, the reader would not be able to read the column metadata. | ||
| Some(file_encryptor.get_footer_encryptor()?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right approach. If the footer is already encrypted we shouldn't need to double-encrypt the column metadata. This goes against what the spec describes and might be incompatible with some readers, plus it will require extra decryption time when reading.
If it's too complex to wire through the information about whether the footer is a plaintext or not to serialize_column_meta_data, I think we should stick with what you had before, or go back to what we currently do in the main branch where the full column metadata is skipped rather than just the stats. My understanding is that writing the metadata in plaintext without the stats was for backwards compatibility with readers that expect the column metadata to always be present when encryption was first being added. But that might not be such a concern now if readers know that field is optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, forgot to reply here. I've introduced bool ColumnChunkMeteData.plaintext_footer_mode that now helps writer decide if it should write metadata or not. See change.
a6a3ec2 to
5e195dc
Compare
adamreeve
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good to me now thanks Rok, and it's nice that we match what the C++ library does.
…pted_column_metadata is present.
Co-authored-by: Adam Reeve <adreeve@gmail.com>
bb283d4 to
2bf1e29
Compare
|
Happy new year @etseidl ! |
etseidl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy New Year @rok! Looks good to me.
Rationale for this change
See #8304 and especially #8304 (comment)