Arrow: Fix ClassCastException in vectorized reader on int-to-long pro…#16343
Open
xndai wants to merge 1 commit into
Open
Arrow: Fix ClassCastException in vectorized reader on int-to-long pro…#16343xndai wants to merge 1 commit into
xndai wants to merge 1 commit into
Conversation
…motion with INT logical type Fix ClassCastException: BigIntVector cannot be cast to IntVector when reading Parquet files with INT(32, true) logical type annotation after promoting a column from int to long. The vectorized reader's LogicalTypeVisitor now allocates vectors based on the Parquet physical type instead of deriving them from the (potentially promoted) Iceberg schema type. Root Cause: In VectorizedArrowReader.allocateFieldVector(), the Arrow field was created from the Iceberg schema type (which reflects the promoted LongType), producing a BigIntVector. The LogicalTypeVisitor then cast this vector to IntVector based on the Parquet file's INT(32) logical type, causing the mismatch. The non-vectorized reader (BaseParquetReaders) already handles this correctly by checking the expected Iceberg type and using IntAsLongReader for promotion. The vectorized reader relies on the accessor layer for widening (IntAccessor.getLong() widens int to long), so the fix ensures the vector matches the physical data layout. Tests: - testIntToLongPromotionWithLogicalType: verifies reading after promotion when file has INT(32, true) annotation (the reported crash) - testIntToLongPromotionWithoutLogicalType: verifies reading after promotion when file has bare INT32
CTTY
reviewed
May 15, 2026
Contributor
CTTY
left a comment
There was a problem hiding this comment.
LGTM! just one minor comment
| // Iceberg has no unsigned integer type. Reading UINT32 into a 32-bit signed value would | ||
| // silently produce negative results for inputs above Integer.MAX_VALUE. UINT8 and UINT16 | ||
| // both fit losslessly in a signed int32 and are allowed, matching the policy in | ||
| // BaseParquetReaders for the non-vectorized path. |
Contributor
There was a problem hiding this comment.
why do we remove this comment? this still looks relevant
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…motion with INT logical type
Fix ClassCastException: BigIntVector cannot be cast to IntVector when reading Parquet files with INT(32, true) logical type annotation after promoting a column from int to long.
The vectorized reader's LogicalTypeVisitor now allocates vectors based on the Parquet physical type instead of deriving them from the (potentially promoted) Iceberg schema type.
Root Cause:
In VectorizedArrowReader.allocateFieldVector(), the Arrow field was created from the Iceberg schema type (which reflects the promoted LongType), producing a BigIntVector. The LogicalTypeVisitor then cast this vector to IntVector based on the Parquet file's INT(32) logical type, causing the mismatch.
The non-vectorized reader (BaseParquetReaders) already handles this correctly by checking the expected Iceberg type and using IntAsLongReader for promotion. The vectorized reader relies on the accessor layer for widening (IntAccessor.getLong() widens int to long), so the fix ensures the vector matches the physical data layout.
Tests:
Fixes #16341