Parquet, Data: Implementation of ParquetFormatModel#15253
Conversation
data/src/main/java/org/apache/iceberg/data/GenericFormatModels.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetFormatModel.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetFormatModel.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetFormatModel.java
Show resolved
Hide resolved
5ed39b4 to
eeb9191
Compare
| FormatModelRegistry.register( | ||
| ParquetFormatModel.create( | ||
| Record.class, | ||
| Void.class, |
There was a problem hiding this comment.
I found Void.class for schema type weird when reviewing the Orc PR. I see Ryan has a previous comment that suggested Void.class.
I checked the uber PR #12298 . For Spark InternalRow, the schema type is Spark StructType. For Flink RowData, the schema type is RowType.
For Iceberg generic Record, shouldn't the schema type be Iceberg Schema or StructType?
There was a problem hiding this comment.
I initially used Schema, but I don’t have a strong preference since it isn’t used yet. Once we start using it, we can adjust the type as needed. The FormatModelRegistry expects the caller to supply the generic parameters, so until this is actually used, the specific type doesn’t make a difference.
There was a problem hiding this comment.
Iceberg GenericRecord requires an Iceberg Schema or StructType to create the object. Hence, I thought the schema type shouldn't be Void.class.
I do agree that this is an implementation code that we can easily update later. There is no interface change here.
There was a problem hiding this comment.
The problem with Schema is that this is almost certainly going to be null when passed to the read and write function because there isn't a separate engine schema. I'd prefer using Void in that case so that people don't actually pass something through. But this gets erased anyway so it doesn't really matter. Callers can do whatever they like anyway, I guess. That's the weird thing about having types that are erased and then re-added using @SuppressWarnings("unchecked").
There was a problem hiding this comment.
Actually, should we mandate that engineSchema is not passed for this? That would make this easier because it would be rejected by the Parquet/Record builders. I'm in favor, but not strongly.
There was a problem hiding this comment.
I see your point, but introducing a new flag across all format models solely to disable engine schema feels like unnecessary complexity to me. And if we later decide to support shredding in the generic models as well, Void would need to become Schema, and we’d have to undo all of these changes anyway.
Let me merge this as it is, and cycle back to this later if we see value in adding these validations
rdblue
left a comment
There was a problem hiding this comment.
Following on the thread from @stevenzwu, I'd probably disable passing engineSchema at all for this (or at least a non-null one) but I'm fine either way. Merge when you're happy with it. Thanks, @pvary!
|
Merged to main. |
Part of: #12298
Implementation of the new API: #12774
ParquetFormatModel and related tests