Skip to content

Validate FIXED_LEN_BYTE_ARRAY length for DECIMAL and INTERVAL types#9985

Merged
etseidl merged 5 commits into
apache:mainfrom
CynicDog:validate-type-length-decimal-interval
May 28, 2026
Merged

Validate FIXED_LEN_BYTE_ARRAY length for DECIMAL and INTERVAL types#9985
etseidl merged 5 commits into
apache:mainfrom
CynicDog:validate-type-length-decimal-interval

Conversation

@CynicDog
Copy link
Copy Markdown
Contributor

@CynicDog CynicDog commented May 16, 2026

Which issue does this PR close?

Rationale for this change

from_fixed_len_byte_array in parquet/src/arrow/schema/primitive.rs does not validate type_length. While PrimitiveTypeBuilder::build() enforces these constraints during schema construction (schema/types.rs:477 for INTERVAL, :565-580 for DECIMAL), schemas decoded directly from Thrift bypass that validation path entirely. As a result:

  • DECIMAL with a type_length outside 1..=32 was silently routed through Decimal128 / Decimal256 using invalid parameters.
  • INTERVAL with a type_length != 12 silently returned Interval(DayTime) regardless.

The same function already rejects FLOAT16 when type_length != 2. This PR mirrors that pattern for DECIMAL and INTERVAL, closing the TODO introduced in #1682.

What changes are included in this PR?

  • Added a check_decimal_length helper to reject type_length values outside 1..=32 for both LogicalType::Decimal and ConvertedType::DECIMAL.
  • Added an inline type_length == 12 check for ConvertedType::INTERVAL.

Are these changes tested?

Yes. Added five new tests in parquet/src/arrow/schema/primitive.rs::tests covering:

  • Invalid lengths ({-1, 0, 33} for DECIMAL, {0, 11, 13} for INTERVAL)
  • Valid lengths (16 → Decimal128, 32 → Decimal256, 12 → Interval(DayTime))

To exercise the reader-side check, the tests construct a valid Type::PrimitiveType via the builder and directly modify the type_length on the resulting enum, simulating a malformed schema decoded from Thrift.

Are there any user-facing changes?

No public API changes. The only behavior change is on the reader side: schemas with an out-of-range type_length for DECIMAL or INTERVAL will now return a ParquetError::General instead of silently producing a mismatched Arrow type.

@github-actions github-actions Bot added the parquet Changes to the parquet crate label May 16, 2026
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @CynicDog

I left some comments

Comment thread parquet/src/arrow/schema/primitive.rs Outdated

fn check_decimal_length(type_length: i32) -> Result<()> {
// Arrow's widest decimal is Decimal256, which needs at most 32 bytes.
if !(1..=32).contains(&type_length) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this a confusing way to check bounds -- I would expect something like

if type_length < 1 || type_length > 32 {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, updated!

// TODO: This should check the type length for the decimal and interval types
match (info.logical_type_ref(), info.converted_type()) {
(Some(LogicalType::Decimal { scale, precision }), _) => {
check_decimal_length(type_length)?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be checking exact lengths? The only valid lengths in this case are 16 and 32 , right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the wording! You're totally right—any byte length is valid here, not just 16 or 32.

As per the Parquet spec for fixed_len_byte_array, the precision is simply limited by the array size. As long as the precision fits within the byte length, it's completely valid.

That upper bound of 32 we were looking at is actually an Arrow constraint (since Decimal256 is 256 bits, which caps it at 32 bytes), rather than a limitation from Parquet itself. Checking for a range of 1..=32 makes perfect sense here, especially since we definitely see perfectly legitimate Parquet files out in the wild with byte lengths like 4 or 7.

@CynicDog CynicDog requested a review from alamb May 22, 2026 12:02
CynicDog added 2 commits May 23, 2026 10:32
The Rust `PrimitiveTypeBuilder` rejects bad `type_length` at construction,
but schemas decoded directly from Thrift bypass that path. Previously a
malformed DECIMAL length silently produced the wrong Arrow type, and an
INTERVAL of length != 12 was accepted unchecked.

Validate `type_length` in `from_fixed_len_byte_array`, mirroring the
existing FLOAT16 pattern: 1..=32 bytes for DECIMAL (Decimal128 below 16,
Decimal256 up to 32), exactly 12 bytes for INTERVAL.
- simplify bounds check to `type_length < 1 || type_length > 32`
- clarify the Decimal128/Decimal256 split with an inline comment
@CynicDog CynicDog force-pushed the validate-type-length-decimal-interval branch from 3a1b099 to 9b2a4f3 Compare May 23, 2026 01:33
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- thank you @CynicDog

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @CynicDog

@etseidl etseidl merged commit 1377761 into apache:main May 28, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Validate FIXED_LEN_BYTE_ARRAY type_length for DECIMAL and INTERVAL in Parquet → Arrow schema conversion

3 participants