Skip to content

Parquet, Data: Implementation of ParquetFormatModel#15253

Merged
pvary merged 1 commit intoapache:mainfrom
pvary:parquet_model
Feb 15, 2026
Merged

Parquet, Data: Implementation of ParquetFormatModel#15253
pvary merged 1 commit intoapache:mainfrom
pvary:parquet_model

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented Feb 7, 2026

Part of: #12298
Implementation of the new API: #12774

ParquetFormatModel and related tests

@pvary pvary force-pushed the parquet_model branch 2 times, most recently from 5ed39b4 to eeb9191 Compare February 10, 2026 22:23
FormatModelRegistry.register(
ParquetFormatModel.create(
Record.class,
Void.class,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found Void.class for schema type weird when reviewing the Orc PR. I see Ryan has a previous comment that suggested Void.class.

I checked the uber PR #12298 . For Spark InternalRow, the schema type is Spark StructType. For Flink RowData, the schema type is RowType.

For Iceberg generic Record, shouldn't the schema type be Iceberg Schema or StructType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially used Schema, but I don’t have a strong preference since it isn’t used yet. Once we start using it, we can adjust the type as needed. The FormatModelRegistry expects the caller to supply the generic parameters, so until this is actually used, the specific type doesn’t make a difference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg GenericRecord requires an Iceberg Schema or StructType to create the object. Hence, I thought the schema type shouldn't be Void.class.

I do agree that this is an implementation code that we can easily update later. There is no interface change here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with Schema is that this is almost certainly going to be null when passed to the read and write function because there isn't a separate engine schema. I'd prefer using Void in that case so that people don't actually pass something through. But this gets erased anyway so it doesn't really matter. Callers can do whatever they like anyway, I guess. That's the weird thing about having types that are erased and then re-added using @SuppressWarnings("unchecked").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, should we mandate that engineSchema is not passed for this? That would make this easier because it would be rejected by the Parquet/Record builders. I'm in favor, but not strongly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, but introducing a new flag across all format models solely to disable engine schema feels like unnecessary complexity to me. And if we later decide to support shredding in the generic models as well, Void would need to become Schema, and we’d have to undo all of these changes anyway.

Let me merge this as it is, and cycle back to this later if we see value in adding these validations

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following on the thread from @stevenzwu, I'd probably disable passing engineSchema at all for this (or at least a non-null one) but I'm fine either way. Merge when you're happy with it. Thanks, @pvary!

@pvary pvary merged commit 235ab0f into apache:main Feb 15, 2026
32 checks passed
@pvary
Copy link
Contributor Author

pvary commented Feb 15, 2026

Merged to main.
Thanks for the review @rdblue and @stevenzwu!

@pvary pvary changed the title Core, Parquet, Data: Implementation of ParquetFormatModel Parquet, Data: Implementation of ParquetFormatModel Feb 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants