[Parquet]: GH-563: Make path_in_schema optional#9678
[Parquet]: GH-563: Make path_in_schema optional#9678etseidl wants to merge 2 commits intoapache:mainfrom
path_in_schema optional#9678Conversation
|
less bloat for the win! |
|
A lot of other parquet implementations require this field, due to their generated thrift parser, even if they do not actually use the field for anything. I would be totally in favor of deprecating and skipping the field, but maybe a more compatible alternative would be to write the field as an empty list instead. |
Well, parquet-java actually uses the field to populate its version of |
See also related mailing list discussion: |
alamb
left a comment
There was a problem hiding this comment.
TLDR is I think this is a great idea. I also think it woudl be ok to merge this into arrow-rs even if there is not a broader consensus on the mailing list that we should do it in the format itself
My thinking is that some usecases are basically using parquet with the same read/writer and compatibility with older java based implementations is not important. This is the same thing for some of the newer encodings too
Letting users choose between "better/faster/stronger" and "more compatible" I think is very much a good idea
| /// | ||
| /// The `path_in_schema` field in the Thrift metadata is redundant and wastes a great | ||
| /// deal of space. Parquet file footers can be made much smaller by omitting this field. | ||
| /// Because the field was originally a mandatory field, this property defaults to `true` |
There was a problem hiding this comment.
If we choose to go this way I think it would help to give some more context here on what types of readers would be affected (basically all the parquet-java based readers prior to late 2026)
We could also perhaps provide a link to the discussion: https://lists.apache.org/thread/czm2bk45wwtkhhpqxqvmx9dk5wkwk1kt
| /// Should the writer should emit the `path_in_schema` element of the | ||
| /// `ColumnMetaData` Thrift struct. |
There was a problem hiding this comment.
I think it is worth making it more apparently that this will cause incompatibilities with some older readers:
| /// Should the writer should emit the `path_in_schema` element of the | |
| /// `ColumnMetaData` Thrift struct. | |
| /// Should the writer should emit the `path_in_schema` element of the | |
| /// `ColumnMetaData` Thrift struct. WARNING: this will make the parquet | |
| /// file unreadable by some older parquet readers. See LINK HERE for details |
|
I added a list of related PRs to this issue's description: |
Which issue does this PR close?
none
Rationale for this change
This is a proof of concept implementation for apache/parquet-format#563
What changes are included in this PR?
Since version 57.0.0, this crate has been tolerant of a missing
path_in_schema. This PR adds options to cease writing the field as well. The option defaults to continuing to write the field.See related discussion on parquet mailing list: https://lists.apache.org/thread/czm2bk45wwtkhhpqxqvmx9dk5wkwk1kt
Are these changes tested?
Not as yet
Are there any user-facing changes?
No, this only adds an optional behavior change that defaults to no change
Related PRs
path_in_schemaoptional parquet-format#563ColumnMetaData.path_in_schemaoptional parquet-format#564path_in_schemaoptional parquet-java#3470