Skip to content

[SPARK-56844][SQL] Support ArrayType / MapType / StructType in ConstantColumnVector and FileSourceMetadataAttribute#55844

Open
mzhang wants to merge 3 commits into
apache:masterfrom
mzhang:mzhang/SPARK-56844-constant-column-vector-complex-types
Open

[SPARK-56844][SQL] Support ArrayType / MapType / StructType in ConstantColumnVector and FileSourceMetadataAttribute#55844
mzhang wants to merge 3 commits into
apache:masterfrom
mzhang:mzhang/SPARK-56844-constant-column-vector-complex-types

Conversation

@mzhang
Copy link
Copy Markdown
Contributor

@mzhang mzhang commented May 13, 2026

What changes were proposed in this pull request?

Allow ArrayType, MapType, and StructType in file source constant metadata attributes. Concretely:

  • FileSourceMetadataAttribute.isSupportedType allows complex types recursively, contingent on their element types being supported. Behavior for primitives, decimal, string, binary, interval, and variant is unchanged.
  • ColumnVectorUtils.populate gains struct/array/map branches:
    • Struct: recurse into pre-allocated child ConstantColumnVectors.
    • Array/map: allocate a one-row OffHeapColumnVector backing and reuse the existing RowToColumnConverter (wrapped in a single-field struct schema) to write the constant value. The view is handed to the constant vector along with ownership of the backing.
  • ConstantColumnVector gains optional ownership of a backing WritableColumnVector (closed by close()), exposed via new setArrayWithBacking / setMapWithBacking. The original setArray / setMap are unchanged (caller retains ownership).
  • ConstantColumnVector's constructor pre-allocates struct children so populate's struct recursion has a target. setChild closes any previously-set child to avoid leaking the auto-allocated one.

Notes on correctness: the recursive copy for array and map reuses RowToColumnConverter, which already drives row-to-columnar conversion across all supported types (RowToColumnarExec). No new per-type dispatch logic is introduced.

Why are the changes needed?

FileSourceMetadataAttribute.isSupportedType is the lone gate preventing complex-typed file source constant metadata; the underlying machinery already supports them.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New cases in ColumnVectorUtilsSuite replacing the previous "not supported" assertions:

  • fill array of ints
  • fill array of strings
  • fill map of int -> boolean
  • fill struct
  • fill nested array<struct> (covers element-level nulls)
  • fill null array

Existing ConstantColumnVectorSuite cases continue to exercise the same paths.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude (Anthropic)

…ntColumnVector and FileSourceMetadataAttribute

### What changes were proposed in this pull request?

Lift the restriction on complex types in file source constant metadata
attributes. Before this change, `FileSourceMetadataAttribute.isSupportedType`
rejected `ArrayType`, `MapType`, and `StructType` outright (citing
`ColumnVectorUtils.populate` as the limiting factor), even though the
underlying machinery (`RowToColumnConverter`, the array/map layout in
`OffHeapColumnVector`, and the struct-child scaffolding in
`ConstantColumnVector`) already supports them.

Concretely:

- `FileSourceMetadataAttribute.isSupportedType` now allows complex types
  recursively, contingent on their element types being supported.
- `ColumnVectorUtils.populate` gains struct/array/map branches:
  - Struct: recurse into pre-allocated child `ConstantColumnVector`s.
  - Array/map: allocate a one-row `OffHeapColumnVector` backing and
    reuse the existing `RowToColumnConverter` (wrapped in a single-field
    struct schema) to write the constant value. The resulting view is
    handed to the constant vector along with ownership of the backing.
- `ConstantColumnVector` gains optional ownership of a backing
  `WritableColumnVector` (closed by `close()`), exposed via new
  `setArrayWithBacking` / `setMapWithBacking` methods. The original
  `setArray` / `setMap` are unchanged (caller retains ownership).
- `ConstantColumnVector`'s constructor now pre-allocates struct children
  so `populate`'s struct recursion has a target. `setChild` closes any
  previously-set child to avoid leaking the auto-allocated one.

### Why are the changes needed?

A natural use case is a file source with metadata such as access control
lists, tags, or per-file annotations whose values are best expressed as
arrays or structs. Today these must be encoded as variants or strings,
even though the column vector implementation can handle the native
types.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Updated `ColumnVectorUtilsSuite` to replace the "not supported" cases
with positive cases that populate array / map / struct constants from
`InternalRow` and assert the resulting values. Existing
`ConstantColumnVectorSuite` cases continue to exercise the same paths.

### Was this patch authored or co-authored using generative AI tooling?

Yes, generated by Claude (Anthropic).
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR widens FileSourceMetadataAttribute.isSupportedType to accept ArrayType/MapType/StructType recursively, and teaches ColumnVectorUtils.populate how to write those types into a ConstantColumnVector. The underlying columnar infrastructure (RowToColumnConverter, OffHeapColumnVector) already supports the element types, so this is mostly plumbing.

Two intentional code paths:

  • Struct recurses into per-field ConstantColumnVector children (auto-allocated by the constructor). Keeps getChild(i) returning a ConstantColumnVector, which existing code (e.g. setCalendarInterval, setVariant) relies on.
  • Array/map allocates a WritableColumnVector(1, t) backing, drives it via RowToColumnConverter with a one-field wrapper struct schema, and hands the ColumnarArray/ColumnarMap view plus ownership of the backing to the ConstantColumnVector via the new setArrayWithBacking/setMapWithBacking. ConstantColumnVector.close() releases the owned backing.

This asymmetry is justified — array/map element data is variable-length and doesn't fit the "one constant value per row" model, so it needs real backing storage and explicit ownership transfer.

A few comments below — the main one is matching the codebase's on-heap/off-heap convention.

Coverage suggestion (not blocking): consider adding to ColumnVectorUtilsSuite (a) a struct with a null field — exercises the inner.isNullAt(i) ? null : ... branch in populate's struct path — and (b) a struct-of-struct case.

…, fix comment, add coverage

- Add `populate(ConstantColumnVector, InternalRow, int, MemoryMode)` overload so the
  array/map backing allocation matches the codebase's on-heap/off-heap convention
  (consistent with OrcFileFormat, OrcScan, ColumnarEvaluatorFactory,
  AggregatePushDownUtils, InMemoryRelation, VectorizedParquetRecordReader). The
  no-arg `populate` delegates with `MemoryMode.ON_HEAP` so test and benchmark call
  sites are unchanged. Production callers (`VectorizedParquetRecordReader`,
  `OrcColumnarBatchReader`, `FileScanRDD`) now pass their existing memory mode.
- Fix stale comment in the struct populate branch.
- Match surrounding test style in the nested-array test (use `i` not `_`).
- Add two coverage cases in `ColumnVectorUtilsSuite`: struct with a null field
  (exercises the per-field null branch) and struct-of-struct (exercises the
  recursive constructor pre-allocation).

Generated-by: Claude (Anthropic)
@mzhang mzhang requested a review from cloud-fan May 13, 2026 04:03
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 3 addressed, 0 remaining, 1 new (1 newly introduced — import order). Thanks for threading MemoryMode through cleanly and adding the null-field / nested-struct test cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants