Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions docs/source/contributor-guide/spark_expressions_support.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,36 +129,120 @@
### array_funcs

- [x] array
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `CreateArray(children, useStringTypeWhenEmpty)`; element type is the common type of children. Comet routes via `CometCreateArray` (native `make_array`) and special-cases the empty-array case to dodge a known DataFusion `coerce_types` issue (#3338).
- Spark 4.0.1 (audited 2026-05-27): semantics unchanged.
- Spark 4.1.1 (audited 2026-05-27): adds `contextIndependentFoldable` override; runtime semantics unchanged.
- [x] array_append
- Spark 3.4.3 (audited 2026-05-27): standalone `BinaryExpression`, evaluated directly. Comet routes via `CometArrayAppend`.
- Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3.
- Spark 4.0.1 (audited 2026-05-27): now `RuntimeReplaceable` and rewritten to `ArrayInsert(arr, Literal(-1), elem)`. `CometArrayAppend` is therefore unreachable; dispatch goes through `CometArrayInsert` (which carries its own `Incompatible` notes documented at the `array_insert` entry).
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- [x] array_compact
- Spark 3.4.3 (audited 2026-05-27): `RuntimeReplaceable` -> `ArrayFilter(arr, IsNotNull(lambda))`. Comet receives the rewritten form, dispatches through `CometArrayFilter`, which delegates back to `CometArrayCompact.convert` for the actual proto emission. The native path uses Comet's `spark_array_compact` UDF rather than DataFusion's `array_remove_all` because DataFusion 53 changed `array_remove_all`'s NULL semantics.
- Spark 3.5.8 (audited 2026-05-27): identical to 3.4.3.
- Spark 4.0.1 (audited 2026-05-27): the replacement is wrapped in `KnownNotContainsNull(...)` (analysis-only hint, no semantic change).
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- [x] array_contains
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayContains(left, right) extends BinaryExpression with NullIntolerant with Predicate`; `inputTypes` uses `findWiderTypeWithoutStringPromotionForTwo`. Wired as `CometScalarFunction("array_contains")`.
- Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` trait replaced by `nullIntolerant: Boolean`; `checkInputDataTypes` adopts `DataTypeUtils.sameType` (collation-aware in 4.x).
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- Known limitation: no NaN-canonicalization guard in `getSupportLevel`. For `Float`/`Double` arrays containing NaN, Spark's `SQLOrderingUtil` may produce different results than DataFusion's IEEE comparison (https://github.com/apache/datafusion-comet/issues/4481).
- [x] array_distinct
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayDistinct(child)` over `ArraySetLike`; uses `SQLOpenHashSet` so NaN and `+0.0`/`-0.0` are canonicalized. Wired as `CometScalarFunction("array_distinct")`.
- Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` -> `nullIntolerant` field refactor.
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- Known divergence: DataFusion `array_distinct` uses hash-based equality without NaN/signed-zero canonicalization, so float/double arrays may produce different results (https://github.com/apache/datafusion-comet/issues/4481).
- [x] array_except
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayExcept(left, right) extends ArrayBinaryLike with ComplexTypeMergingExpression`; result preserves left-side first occurrences not present in right. Comet routes via `CometArrayExcept` and unconditionally flags `Incompatible` ("Null handling and ordering may differ from Spark"); also falls back for `BinaryType` / `StructType` element types.
- Spark 4.0.1 (audited 2026-05-27): `nullIntolerant = true` moves into `ArrayBinaryLike`; the overflow path uses `arrayFunctionWithElementsExceedLimitError`.
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- Known divergence: same NaN/signed-zero canonicalization gap as `array_distinct` for float/double arrays (https://github.com/apache/datafusion-comet/issues/4481).
- [x] array_insert
- Spark 3.4.3 audited 2026-04-02
- Spark 3.5.8 audited 2026-04-02
- Spark 4.0.1 audited 2026-04-02 (pos=0 error message differs from Spark)
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- [x] array_intersect
- Spark 3.4.3 audited 2026-04-24 (result element order may differ from Spark when the right array is longer than the left; DataFusion probes the longer side)
- Spark 3.5.8 audited 2026-04-24 (same ordering incompatibility as 3.4.3)
- Spark 4.0.1 audited 2026-04-24 (ordering incompatibility as above; collated strings now fall back to Spark)
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- [x] array_join
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayJoin(array, delimiter, nullReplacement)`. Comet routes via `CometArrayJoin` to DataFusion's `array_to_string` and is unconditionally flagged `Incompatible` ("Null handling may differ from Spark", #3178).
- Spark 4.0.1 (audited 2026-05-27): `inputTypes` widened to `AbstractArrayType(StringTypeWithCollation(supportsTrimCollation = true))`; non-binary collations not propagated (https://github.com/apache/datafusion-comet/issues/2190).
- Spark 4.1.1 (audited 2026-05-27): adds `contextIndependentFoldable` override; runtime unchanged.
- [x] array_max
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayMax(child) extends UnaryExpression with ImplicitCastInputTypes`; skips NULL elements; for float/double Spark's `SQLOrderingUtil` treats NaN as greater than any non-NaN. Wired as `CometScalarFunction("array_max")`.
- Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` -> `nullIntolerant` field refactor.
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- Known divergence: DataFusion's `array_max` uses Arrow `partial_cmp`-based ordering, so float/double arrays containing NaN may produce different results (https://github.com/apache/datafusion-comet/issues/4482).
- [x] array_min
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): mirror of `ArrayMax` with `evalInternal` returning the minimum. Same NULL-skip and NaN-ordering semantics. Wired as `CometScalarFunction("array_min")`.
- Spark 4.0.1 (audited 2026-05-27): same trait refactor as `array_max`.
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- Known divergence: same NaN-handling gap as `array_max` (https://github.com/apache/datafusion-comet/issues/4482).
- [x] array_position
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayPosition(left, right)`; returns 1-based `LongType` position, 0 if not found, NULL if either input is NULL. `CometArrayPosition` falls back for all-foldable args (constant folding handles those) and for unsupported element types (binary/struct/map/null).
- Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` -> `nullIntolerant` field refactor.
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- [ ] array_prepend
- [x] array_remove
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayRemove(left, right)`; removes all occurrences equal to `right`. Wired as `CometScalarFunction("array_remove")`. Falls back via `ArraysBase.isTypeSupported` for binary/struct/map/null child types.
- Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` -> `nullIntolerant` field refactor.
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- [x] array_repeat
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayRepeat(left, right) extends BinaryExpression with ExpectsInputTypes`; `inputTypes = Seq(AnyDataType, IntegerType)`. NULL count yields NULL; count <= 0 yields empty array; count > `MAX_ROUNDED_ARRAY_LENGTH` throws at runtime. Comet wraps the call in `CaseWhen(IsNotNull(right), array_repeat(...), null)`.
- Spark 4.0.1 (audited 2026-05-27): error message uses `createArrayWithElementsExceedLimitError(prettyName, count)`; semantics unchanged.
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- [x] array_union
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArrayUnion(left, right) extends ArrayBinaryLike with ComplexTypeMergingExpression`; result is left-side distinct elements followed by new right-side elements. Wired as `CometScalarFunction("array_union")`.
- Spark 4.0.1 (audited 2026-05-27): `nullIntolerant = true` moves into `ArrayBinaryLike`; overflow path uses `arrayFunctionWithElementsExceedLimitError`.
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- Known divergence: same NaN/signed-zero canonicalization gap as `array_distinct` (https://github.com/apache/datafusion-comet/issues/4481). Result ordering versus DataFusion is also unverified; compare the `array_intersect` ordering caveat.
- [x] arrays_overlap
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArraysOverlap(left, right)`; three-valued logic (TRUE if any common non-null element, NULL if a null is present and no overlap is found in non-nulls, FALSE otherwise). Comet routes via `CometArraysOverlap` to the native `spark_arrays_overlap` UDF, which implements the same three-valued logic.
- Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` -> `nullIntolerant` field refactor.
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- [x] arrays_zip
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `ArraysZip(children, names)`; returns an array of structs, padding shorter inputs with NULL. Comet routes via `CometArraysZip` and rejects unsupported child element types (anything outside primitives, decimals, dates/timestamps, strings, binary, and nested arrays/structs of those).
- Spark 4.0.1 (audited 2026-05-27): the length-mismatch error switches from `IllegalArgumentException` to `SparkIllegalArgumentException("_LEGACY_ERROR_TEMP_3235")`; runtime unchanged.
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- [x] element_at
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `ElementAt(left, right, defaultValueOutOfBound, failOnError)`; group label `map_funcs`. Comet supports only `ArrayType` input; `MapType` input falls back.
- Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` -> `nullIntolerant` field refactor; group label changes to `collection_funcs`; ANSI default flips to `true` so out-of-bound throws by default. Comet wires `failOnError` through to native `ListExtract`.
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- [x] flatten
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `Flatten(child) extends UnaryExpression`; returns NULL if any inner sub-array is NULL. Comet routes via `CometFlatten` and falls back for child types containing `BinaryType` / `StructType` / `MapType` (limitation of `ArraysBase.isTypeSupported`).
- Spark 4.0.1 (audited 2026-05-27): `NullIntolerant` -> `nullIntolerant` field refactor.
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- [x] get
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `GetArrayItem(child, ordinal, failOnError)`; `inputTypes = Seq(AnyDataType, IntegralType)`. Comet routes via `CometGetArrayItem`, wiring `failOnError` through to the proto.
- Spark 4.0.1 (audited 2026-05-27): semantics unchanged; ANSI default flips to `true`.
- Spark 4.1.1 (audited 2026-05-27): `inputTypes` tightened to `Seq(ArrayType, IntegralType)` (analysis-time only); runtime unchanged.
- [ ] sequence
- [ ] shuffle
- [ ] slice
- [x] sort_array
- Spark 3.4.3 (audited 2026-05-27): identical to 3.5.8.
- Spark 3.5.8 (audited 2026-05-27): baseline. `SortArray(base, ascendingOrder) extends BinaryExpression with ArraySortLike`; the second arg must be a `Literal(_: Boolean, BooleanType)`. Comet `CometSortArray` flags `Incompatible` under strict floating-point and falls back for nested arrays whose innermost element is `Struct` or `Null`.
- Spark 4.0.1 (audited 2026-05-27): trait set changes substantively: `ArraySortLike` and `NullIntolerant` are removed, `nullIntolerant = true` becomes an override, and `ascendingOrder` is widened to accept any foldable boolean (not just `Literal`). Comet's `CometSortArray` still requires a `Literal`, so the new foldable form falls back at convert time.
- Spark 4.1.1 (audited 2026-05-27): identical to 4.0.1.

### bitwise_funcs

Expand Down
14 changes: 9 additions & 5 deletions spark/src/main/scala/org/apache/comet/serde/arrays.scala
Original file line number Diff line number Diff line change
Expand Up @@ -334,10 +334,12 @@ object CometArrayCompact extends CometExpressionSerde[Expression] {

object CometArrayExcept extends CometExpressionSerde[ArrayExcept] with CometExprShim {

override def getIncompatibleReasons(): Seq[String] = Seq(
"Null handling and ordering may differ from Spark")
private val incompatReason = "Null handling and ordering may differ from Spark"

override def getIncompatibleReasons(): Seq[String] = Seq(incompatReason)

override def getSupportLevel(expr: ArrayExcept): SupportLevel = Incompatible(None)
override def getSupportLevel(expr: ArrayExcept): SupportLevel = Incompatible(
Some(incompatReason))

@tailrec
def isTypeSupported(dt: DataType): Boolean = {
Expand Down Expand Up @@ -376,9 +378,11 @@ object CometArrayExcept extends CometExpressionSerde[ArrayExcept] with CometExpr

object CometArrayJoin extends CometExpressionSerde[ArrayJoin] {

override def getIncompatibleReasons(): Seq[String] = Seq("Null handling may differ from Spark")
private val incompatReason = "Null handling may differ from Spark"

override def getIncompatibleReasons(): Seq[String] = Seq(incompatReason)

override def getSupportLevel(expr: ArrayJoin): SupportLevel = Incompatible(None)
override def getSupportLevel(expr: ArrayJoin): SupportLevel = Incompatible(Some(incompatReason))

override def convert(
expr: ArrayJoin,
Expand Down
Loading