Skip to content

[VL] Support type widening in Parquet reader (SPARK-40876) #11683

@baibaichen

Description

@baibaichen

Labels: enhancement, VELOX


Description

Enable the GlutenParquetTypeWideningSuite test suite for Spark 4.0 and 4.1, which validates Parquet type widening support (SPARK-40876).

Background

GlutenParquetTypeWideningSuite has 84 tests covering two types of Parquet type conversions:

  1. Physical→Logical type restoration: Reading int32 + INT(8) as TINYINT (safe, writer guarantees value range)
  2. Schema evolution widening: Reading old IntegerType data as LongType, DoubleType, or DecimalType (Spark 4.0 feature)

Currently the suite is disabled with 74 out of 84 tests failing. The failures fall into four categories:

Category Count Issue Fix
A 13 Velox doesn't support INT→DOUBLE/REAL/DECIMAL widening Velox C++ convertType() extension
B 29 Exception type mismatch + no Decimal precision check Exception translation + C++ precision check
C 31 Parquet V2 encoding assertions + Decimal conversion limits Disable native writer + test overrides + Velox C++
D 1 parquet-mr only decimal narrowing overflow→null Exclude (cannot reproduce with native reader)

Plan

This will be addressed in 3 PRs:

  1. PR 1 — Exception translation: Add translateException() to convert Velox type errors to SchemaColumnConvertNotSupportedException. Enable the suite with appropriate excludes/overrides for tests that pass without C++ changes.

  2. PR 2 — SPARK-18108 + Revert OAP: Fix partition column type conflicts. Import upstream Velox PR #15173.

  3. PR 3 — Type widening implementation: Velox C++ changes for INT→DOUBLE/REAL/DECIMAL and Decimal→Decimal widening. Requires upstream Velox PR first, then enable remaining tests.

Test Results (Target)

Spark 4.0 Spark 4.1
✅ Passed 46 46
🟢 Override (passed) 35 35
❌ Excluded 3 3
Total 84 84

Sub-issue of #11550.

This issue was written with the assistance of AI.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions