Skip to content

Core, Orc, Data: Implementation of ORCFormatModel#15255

Merged
pvary merged 1 commit intoapache:mainfrom
pvary:orc_model
Feb 15, 2026
Merged

Core, Orc, Data: Implementation of ORCFormatModel#15255
pvary merged 1 commit intoapache:mainfrom
pvary:orc_model

Conversation

@pvary
Copy link
Contributor

@pvary pvary commented Feb 7, 2026

Part of: #12298
Implementation of the new API: #12774

OrcFormatModel and related tests

Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the Void.class schema class comment to the Parquet PR

file,
conf,
schema,
// This is a behavioral change. Previously there were an error if metadata columns were
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do this behavior change in a separate PR? or it is required to get the build pass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed, so ORC handles the partitioned reads the same way as the other readers

Copy link
Contributor

@stevenzwu stevenzwu Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is intuitive why we are remove the constant and metadata field ids from the read schema. but I need some help to understand how are they related to the partitioned reads?

Also can you point me to the code how Parquet readers handle this situation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we should probably remove the This is a behavioral change from the code comment. It should be just a PR comment. code comment should just explain why we are stripping away the constant and metadata fields from the schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the comment.

In nutshell, the VectorizedSparkOrcReaders, GenericOrcReader.buildReader, etc functions need the full schema to create the "readers" for the constant columns, but the physical reader don't need them. Every caller currently makes sure that the columns which are not necessary are removed when the physical reader is created. This will not work when the generic parametrization is used. We can do it in the ORCFormatModel, but this change is just fixing a bug in my opinion.

@pvary
Copy link
Contributor Author

pvary commented Feb 13, 2026

Rebased on top of main

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too, thanks @pvary !

@pvary pvary merged commit 83fe4ff into apache:main Feb 15, 2026
32 checks passed
@pvary
Copy link
Contributor Author

pvary commented Feb 15, 2026

Merged to main.
Thanks @stevenzwu and @singhpk234 for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants