Summary
Follow-up to #502. The data conversion layer now supports LargeListArray (64-bit offsets) via ProjectRecordBatch, but the Parquet reader's schema validation still rejects LARGE_LIST types. Additionally, the reader needs to expose Arrow's list_type property to allow users to request LargeListArray output.
Problem
ValidateParquetSchemaEvolution in parquet_schema_util.cc:177-180 only accepts ::arrow::Type::LIST:
case TypeId::kList:
if (arrow_type->id() == ::arrow::Type::LIST) {
return {};
}
break;
- Arrow's Parquet reader defaults to
Type::LIST output. Without exposing ArrowReaderProperties::set_list_type(), users cannot request LargeListArray output.
Proposed Solution
1. Update schema validation to accept both list types
case TypeId::kList:
if (arrow_type->id() == ::arrow::Type::LIST ||
arrow_type->id() == ::arrow::Type::LARGE_LIST) {
return {};
}
break;
2. Add kListType to ReaderProperties
Expose a property to configure the Arrow list type preference.
3. Pass through to Arrow reader
In ParquetReader::Impl::Open(), call arrow_reader_properties.set_list_type() with the configured value.
Why This Is Safe
- Iceberg's
ListType doesn't distinguish between LIST and LARGE_LIST
- The projection layer (
ProjectRecordBatch) already handles both via templated ProjectListArrayImpl<>
- Both represent the same logical "list" concept, just with different offset sizes
Files to Change
src/iceberg/parquet/parquet_schema_util.cc - Update ValidateParquetSchemaEvolution
src/iceberg/parquet/parquet_reader.cc - Pass list_type to ArrowReaderProperties
src/iceberg/reader.h - Add kListType to ReaderProperties
src/iceberg/test/parquet_test.cc - Add integration tests
Related
Summary
Follow-up to #502. The data conversion layer now supports
LargeListArray(64-bit offsets) viaProjectRecordBatch, but the Parquet reader's schema validation still rejectsLARGE_LISTtypes. Additionally, the reader needs to expose Arrow'slist_typeproperty to allow users to requestLargeListArrayoutput.Problem
ValidateParquetSchemaEvolutioninparquet_schema_util.cc:177-180only accepts::arrow::Type::LIST:Type::LISToutput. Without exposingArrowReaderProperties::set_list_type(), users cannot requestLargeListArrayoutput.Proposed Solution
1. Update schema validation to accept both list types
2. Add
kListTypetoReaderPropertiesExpose a property to configure the Arrow list type preference.
3. Pass through to Arrow reader
In
ParquetReader::Impl::Open(), callarrow_reader_properties.set_list_type()with the configured value.Why This Is Safe
ListTypedoesn't distinguish between LIST and LARGE_LISTProjectRecordBatch) already handles both via templatedProjectListArrayImpl<>Files to Change
src/iceberg/parquet/parquet_schema_util.cc- UpdateValidateParquetSchemaEvolutionsrc/iceberg/parquet/parquet_reader.cc- Passlist_typetoArrowReaderPropertiessrc/iceberg/reader.h- AddkListTypetoReaderPropertiessrc/iceberg/test/parquet_test.cc- Add integration testsRelated