Validate FIXED_LEN_BYTE_ARRAY length for DECIMAL and INTERVAL types#9985
Conversation
|
|
||
| fn check_decimal_length(type_length: i32) -> Result<()> { | ||
| // Arrow's widest decimal is Decimal256, which needs at most 32 bytes. | ||
| if !(1..=32).contains(&type_length) { |
There was a problem hiding this comment.
I found this a confusing way to check bounds -- I would expect something like
if type_length < 1 || type_length > 32 {There was a problem hiding this comment.
Makes sense, updated!
| // TODO: This should check the type length for the decimal and interval types | ||
| match (info.logical_type_ref(), info.converted_type()) { | ||
| (Some(LogicalType::Decimal { scale, precision }), _) => { | ||
| check_decimal_length(type_length)?; |
There was a problem hiding this comment.
should this be checking exact lengths? The only valid lengths in this case are 16 and 32 , right?
There was a problem hiding this comment.
Good catch on the wording! You're totally right—any byte length is valid here, not just 16 or 32.
As per the Parquet spec for fixed_len_byte_array, the precision is simply limited by the array size. As long as the precision fits within the byte length, it's completely valid.
That upper bound of 32 we were looking at is actually an Arrow constraint (since Decimal256 is 256 bits, which caps it at 32 bytes), rather than a limitation from Parquet itself. Checking for a range of 1..=32 makes perfect sense here, especially since we definitely see perfectly legitimate Parquet files out in the wild with byte lengths like 4 or 7.
The Rust `PrimitiveTypeBuilder` rejects bad `type_length` at construction, but schemas decoded directly from Thrift bypass that path. Previously a malformed DECIMAL length silently produced the wrong Arrow type, and an INTERVAL of length != 12 was accepted unchecked. Validate `type_length` in `from_fixed_len_byte_array`, mirroring the existing FLOAT16 pattern: 1..=32 bytes for DECIMAL (Decimal128 below 16, Decimal256 up to 32), exactly 12 bytes for INTERVAL.
- simplify bounds check to `type_length < 1 || type_length > 32` - clarify the Decimal128/Decimal256 split with an inline comment
3a1b099 to
9b2a4f3
Compare
Which issue does this PR close?
Rationale for this change
from_fixed_len_byte_arrayinparquet/src/arrow/schema/primitive.rsdoes not validatetype_length. WhilePrimitiveTypeBuilder::build()enforces these constraints during schema construction (schema/types.rs:477for INTERVAL,:565-580for DECIMAL), schemas decoded directly from Thrift bypass that validation path entirely. As a result:DECIMALwith atype_lengthoutside1..=32was silently routed throughDecimal128/Decimal256using invalid parameters.INTERVALwith atype_length != 12silently returnedInterval(DayTime)regardless.The same function already rejects
FLOAT16whentype_length != 2. This PR mirrors that pattern for DECIMAL and INTERVAL, closing the TODO introduced in #1682.What changes are included in this PR?
check_decimal_lengthhelper to rejecttype_lengthvalues outside1..=32for bothLogicalType::DecimalandConvertedType::DECIMAL.type_length == 12check forConvertedType::INTERVAL.Are these changes tested?
Yes. Added five new tests in
parquet/src/arrow/schema/primitive.rs::testscovering:{-1, 0, 33}for DECIMAL,{0, 11, 13}for INTERVAL)Decimal128, 32 →Decimal256, 12 →Interval(DayTime))To exercise the reader-side check, the tests construct a valid
Type::PrimitiveTypevia the builder and directly modify thetype_lengthon the resulting enum, simulating a malformed schema decoded from Thrift.Are there any user-facing changes?
No public API changes. The only behavior change is on the reader side: schemas with an out-of-range
type_lengthfor DECIMAL or INTERVAL will now return aParquetError::Generalinstead of silently producing a mismatched Arrow type.