[SPARK-56844][SQL] Support ArrayType / MapType / StructType in ConstantColumnVector and FileSourceMetadataAttribute#55844
Conversation
…ntColumnVector and FileSourceMetadataAttribute
### What changes were proposed in this pull request?
Lift the restriction on complex types in file source constant metadata
attributes. Before this change, `FileSourceMetadataAttribute.isSupportedType`
rejected `ArrayType`, `MapType`, and `StructType` outright (citing
`ColumnVectorUtils.populate` as the limiting factor), even though the
underlying machinery (`RowToColumnConverter`, the array/map layout in
`OffHeapColumnVector`, and the struct-child scaffolding in
`ConstantColumnVector`) already supports them.
Concretely:
- `FileSourceMetadataAttribute.isSupportedType` now allows complex types
recursively, contingent on their element types being supported.
- `ColumnVectorUtils.populate` gains struct/array/map branches:
- Struct: recurse into pre-allocated child `ConstantColumnVector`s.
- Array/map: allocate a one-row `OffHeapColumnVector` backing and
reuse the existing `RowToColumnConverter` (wrapped in a single-field
struct schema) to write the constant value. The resulting view is
handed to the constant vector along with ownership of the backing.
- `ConstantColumnVector` gains optional ownership of a backing
`WritableColumnVector` (closed by `close()`), exposed via new
`setArrayWithBacking` / `setMapWithBacking` methods. The original
`setArray` / `setMap` are unchanged (caller retains ownership).
- `ConstantColumnVector`'s constructor now pre-allocates struct children
so `populate`'s struct recursion has a target. `setChild` closes any
previously-set child to avoid leaking the auto-allocated one.
### Why are the changes needed?
A natural use case is a file source with metadata such as access control
lists, tags, or per-file annotations whose values are best expressed as
arrays or structs. Today these must be encoded as variants or strings,
even though the column vector implementation can handle the native
types.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Updated `ColumnVectorUtilsSuite` to replace the "not supported" cases
with positive cases that populate array / map / struct constants from
`InternalRow` and assert the resulting values. Existing
`ConstantColumnVectorSuite` cases continue to exercise the same paths.
### Was this patch authored or co-authored using generative AI tooling?
Yes, generated by Claude (Anthropic).
cloud-fan
left a comment
There was a problem hiding this comment.
The PR widens FileSourceMetadataAttribute.isSupportedType to accept ArrayType/MapType/StructType recursively, and teaches ColumnVectorUtils.populate how to write those types into a ConstantColumnVector. The underlying columnar infrastructure (RowToColumnConverter, OffHeapColumnVector) already supports the element types, so this is mostly plumbing.
Two intentional code paths:
- Struct recurses into per-field
ConstantColumnVectorchildren (auto-allocated by the constructor). KeepsgetChild(i)returning aConstantColumnVector, which existing code (e.g.setCalendarInterval,setVariant) relies on. - Array/map allocates a
WritableColumnVector(1, t)backing, drives it viaRowToColumnConverterwith a one-field wrapper struct schema, and hands theColumnarArray/ColumnarMapview plus ownership of the backing to theConstantColumnVectorvia the newsetArrayWithBacking/setMapWithBacking.ConstantColumnVector.close()releases the owned backing.
This asymmetry is justified — array/map element data is variable-length and doesn't fit the "one constant value per row" model, so it needs real backing storage and explicit ownership transfer.
A few comments below — the main one is matching the codebase's on-heap/off-heap convention.
Coverage suggestion (not blocking): consider adding to ColumnVectorUtilsSuite (a) a struct with a null field — exercises the inner.isNullAt(i) ? null : ... branch in populate's struct path — and (b) a struct-of-struct case.
…, fix comment, add coverage - Add `populate(ConstantColumnVector, InternalRow, int, MemoryMode)` overload so the array/map backing allocation matches the codebase's on-heap/off-heap convention (consistent with OrcFileFormat, OrcScan, ColumnarEvaluatorFactory, AggregatePushDownUtils, InMemoryRelation, VectorizedParquetRecordReader). The no-arg `populate` delegates with `MemoryMode.ON_HEAP` so test and benchmark call sites are unchanged. Production callers (`VectorizedParquetRecordReader`, `OrcColumnarBatchReader`, `FileScanRDD`) now pass their existing memory mode. - Fix stale comment in the struct populate branch. - Match surrounding test style in the nested-array test (use `i` not `_`). - Add two coverage cases in `ColumnVectorUtilsSuite`: struct with a null field (exercises the per-field null branch) and struct-of-struct (exercises the recursive constructor pre-allocation). Generated-by: Claude (Anthropic)
cloud-fan
left a comment
There was a problem hiding this comment.
LGTM. 3 addressed, 0 remaining, 1 new (1 newly introduced — import order). Thanks for threading MemoryMode through cleanly and adding the null-field / nested-struct test cases.
Generated-by: Claude (Anthropic)
What changes were proposed in this pull request?
Allow
ArrayType,MapType, andStructTypein file source constant metadata attributes. Concretely:FileSourceMetadataAttribute.isSupportedTypeallows complex types recursively, contingent on their element types being supported. Behavior for primitives, decimal, string, binary, interval, and variant is unchanged.ColumnVectorUtils.populategains struct/array/map branches:ConstantColumnVectors.OffHeapColumnVectorbacking and reuse the existingRowToColumnConverter(wrapped in a single-field struct schema) to write the constant value. The view is handed to the constant vector along with ownership of the backing.ConstantColumnVectorgains optional ownership of a backingWritableColumnVector(closed byclose()), exposed via newsetArrayWithBacking/setMapWithBacking. The originalsetArray/setMapare unchanged (caller retains ownership).ConstantColumnVector's constructor pre-allocates struct children sopopulate's struct recursion has a target.setChildcloses any previously-set child to avoid leaking the auto-allocated one.Notes on correctness: the recursive copy for array and map reuses
RowToColumnConverter, which already drives row-to-columnar conversion across all supported types (RowToColumnarExec). No new per-type dispatch logic is introduced.Why are the changes needed?
FileSourceMetadataAttribute.isSupportedTypeis the lone gate preventing complex-typed file source constant metadata; the underlying machinery already supports them.Does this PR introduce any user-facing change?
No.
How was this patch tested?
New cases in
ColumnVectorUtilsSuitereplacing the previous "not supported" assertions:fill array of intsfill array of stringsfill map of int -> booleanfill structfill nested array<struct>(covers element-level nulls)fill null arrayExisting
ConstantColumnVectorSuitecases continue to exercise the same paths.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude (Anthropic)