Parquet, Data: Implementation of ParquetFormatModel by pvary · Pull Request #15253 · apache/iceberg

pvary · 2026-02-07T08:37:59Z

Part of: #12298
Implementation of the new API: #12774

ParquetFormatModel and related tests

data/src/main/java/org/apache/iceberg/data/GenericFormatModels.java

data/src/test/java/org/apache/iceberg/data/TestGenericFormatModels.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetFormatModel.java

stevenzwu · 2026-02-11T05:25:18Z

data/src/main/java/org/apache/iceberg/data/GenericFormatModels.java

+    FormatModelRegistry.register(
+        ParquetFormatModel.create(
+            Record.class,
+            Void.class,


I found Void.class for schema type weird when reviewing the Orc PR. I see Ryan has a previous comment that suggested Void.class.

I checked the uber PR #12298 . For Spark InternalRow, the schema type is Spark StructType. For Flink RowData, the schema type is RowType.

For Iceberg generic Record, shouldn't the schema type be Iceberg Schema or StructType?

I initially used Schema, but I don’t have a strong preference since it isn’t used yet. Once we start using it, we can adjust the type as needed. The FormatModelRegistry expects the caller to supply the generic parameters, so until this is actually used, the specific type doesn’t make a difference.

Iceberg GenericRecord requires an Iceberg Schema or StructType to create the object. Hence, I thought the schema type shouldn't be Void.class.

I do agree that this is an implementation code that we can easily update later. There is no interface change here.

The problem with Schema is that this is almost certainly going to be null when passed to the read and write function because there isn't a separate engine schema. I'd prefer using Void in that case so that people don't actually pass something through. But this gets erased anyway so it doesn't really matter. Callers can do whatever they like anyway, I guess. That's the weird thing about having types that are erased and then re-added using @SuppressWarnings("unchecked").

Actually, should we mandate that engineSchema is not passed for this? That would make this easier because it would be rejected by the Parquet/Record builders. I'm in favor, but not strongly.

I see your point, but introducing a new flag across all format models solely to disable engine schema feels like unnecessary complexity to me. And if we later decide to support shredding in the generic models as well, Void would need to become Schema, and we’d have to undo all of these changes anyway.

Let me merge this as it is, and cycle back to this later if we see value in adding these validations

rdblue

Following on the thread from @stevenzwu, I'd probably disable passing engineSchema at all for this (or at least a non-null one) but I'm fine either way. Merge when you're happy with it. Thanks, @pvary!

pvary · 2026-02-15T09:53:36Z

Merged to main.
Thanks for the review @rdblue and @stevenzwu!

github-actions bot added parquet core data labels Feb 7, 2026

pvary mentioned this pull request Feb 7, 2026

Core, Data: File Format API interfaces #12774

Merged

pvary force-pushed the parquet_model branch from 185b101 to d80116d Compare February 7, 2026 09:31

pvary mentioned this pull request Feb 7, 2026

Core, Arrow: Implementation of ArrowFormatModel #15258

Merged