Core, Orc, Data: Implementation of ORCFormatModel#15255
Conversation
stevenzwu
left a comment
There was a problem hiding this comment.
I left the Void.class schema class comment to the Parquet PR
| file, | ||
| conf, | ||
| schema, | ||
| // This is a behavioral change. Previously there were an error if metadata columns were |
There was a problem hiding this comment.
can we do this behavior change in a separate PR? or it is required to get the build pass?
There was a problem hiding this comment.
This is needed, so ORC handles the partitioned reads the same way as the other readers
There was a problem hiding this comment.
It is intuitive why we are remove the constant and metadata field ids from the read schema. but I need some help to understand how are they related to the partitioned reads?
Also can you point me to the code how Parquet readers handle this situation?
There was a problem hiding this comment.
Also we should probably remove the This is a behavioral change from the code comment. It should be just a PR comment. code comment should just explain why we are stripping away the constant and metadata fields from the schema.
There was a problem hiding this comment.
Removed the comment.
In nutshell, the VectorizedSparkOrcReaders, GenericOrcReader.buildReader, etc functions need the full schema to create the "readers" for the constant columns, but the physical reader don't need them. Every caller currently makes sure that the columns which are not necessary are removed when the physical reader is created. This will not work when the generic parametrization is used. We can do it in the ORCFormatModel, but this change is just fixing a bug in my opinion.
|
Rebased on top of main |
singhpk234
left a comment
There was a problem hiding this comment.
LGTM too, thanks @pvary !
|
Merged to main. |
Part of: #12298
Implementation of the new API: #12774
OrcFormatModel and related tests