Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions docs/source/contributor-guide/spark_expressions_support.md
Original file line number Diff line number Diff line change
Expand Up @@ -364,78 +364,141 @@
### math_funcs

- [x] `%`
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Remainder(left, right, evalMode)` signature identical across versions. Native path uses Rust `spark_modulo` UDF; non-ANSI returns NULL on divide-by-zero, ANSI raises `DIVIDE_BY_ZERO` / `REMAINDER_BY_ZERO`. `CometRemainder` rejects `EvalMode.TRY`, so `try_mod` (Spark 4.0+) falls back to Spark (https://github.com/apache/datafusion-comet/issues/4484).
- [x] `*`
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Multiply(left, right, evalMode)` signature identical. Decimal results exceeding `DECIMAL128_MAX_PRECISION` go through `WideDecimalBinaryExpr` (Decimal256 intermediate); smaller decimals and primitives use DataFusion `BinaryExpr`. ANSI integer overflow uses Rust `checked_mul`. Interval multiplication falls back.
- [x] `+`
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Add(left, right, evalMode)` with the same Decimal / ANSI plumbing as `*`. `Date + Int8/16/32` dispatches to the Rust `date_add` UDF to work around DataFusion's Date32 + Interval-only kernel.
- [x] `-`
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Subtract(left, right, evalMode)` mirrors `+`. `Date - Int8/16/32` uses the Rust `date_sub` UDF.
- [x] `/`
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Divide(left, right, evalMode)`. Non-ANSI mode wraps the divisor in `If(EqualTo(right, 0), null, right)` so DataFusion never throws. Decimal output is wrapped in `CheckOverflow(failOnError = ANSI)`; ANSI surfaces `NUMERIC_VALUE_OUT_OF_RANGE`, non-ANSI returns NULL.
- [x] abs
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Abs(child, failOnError)` over `NumericType` plus the two interval types. `failOnError` (ANSI) is propagated to the native `abs` UDF, which throws `ARITHMETIC_OVERFLOW` on `Int.MinValue` / `Long.MinValue` / Decimal MIN. `DayTimeIntervalType` and `YearMonthIntervalType` fall back to Spark. Spark 4.0 / 4.1 do the `NullIntolerant` -> `nullIntolerant: Boolean` refactor; behaviour unchanged.
- [x] acos
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.acos, "ACOS")` unchanged across versions; wired as `CometScalarFunction("acos")` to DataFusion's `acos` UDF. NaN for `|x| > 1`.
- [x] acosh
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): custom `StrictMath.log(x + sqrt(x*x - 1))` unchanged across versions. NaN for `x < 1`. Routes to DataFusion's `acosh`.
- [x] asin
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.asin, "ASIN")` unchanged. NaN for `|x| > 1`.
- [x] asinh
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): special-cases `Double.NegativeInfinity` to avoid `log(NaN)`, otherwise `StrictMath.log(x + sqrt(x*x + 1))`. Identical across versions.
- [x] atan
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.atan, "ATAN")` unchanged.
- [x] atan2
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `BinaryMathExpression(math.atan2, "ATAN2")` with both inputs adjusted by `+0.0` to flip `-0.0` to `+0.0`. `CometAtan2` reproduces this by wrapping each child in `Add(child, Literal.default(child.dataType))` before dispatching to DataFusion's `atan2`.
- [x] atanh
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): custom `0.5 * (log1p(x) - log1p(-x))` (SPARK-28519). NaN for `|x| > 1`, +/-Infinity for `x = +/-1`.
- [x] bin
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Bin(child)` over `LongType -> StringType`. Spark 4.x gains `DefaultStringProducingExpression` and the `nullIntolerant: Boolean` refactor; no behaviour change. Routes to datafusion-spark `SparkBin`.
- [ ] bround
- [x] cbrt
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): passthrough to DataFusion `cbrt`.
- [x] ceil
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): one-arg `ceil(expr)` supported (`LongType` / `DoubleType` / `DecimalType` with scale >= 0). Decimal with negative scale falls back at convert time. The two-arg `ceil(expr, scale)` form (`RoundCeil`) is not wired and falls back to Spark.
- [x] ceiling
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): registry alias for `Ceil`. Same support as `ceil`.
- [ ] conv
- [x] cos
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.cos, "COS")` unchanged across versions.
- [x] cosh
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.cosh, "COSH")` unchanged.
- [x] cot
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): custom `1 / math.tan(x)`. DataFusion's `cot` is also `1.0 / tan(x)`, so the result matches.
- [x] csc
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): custom `1 / math.sin(x)`. Routed to datafusion-spark's `SparkCsc` (registered in `jni_api.rs`).
- [x] degrees
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.toDegrees, "DEGREES")` unchanged across versions.
- [x] div
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `IntegralDivide(left, right, evalMode)`. Non-decimal operands are cast to `DecimalType(19, 0)`; result is recomputed per `IntegralDivide.resultDecimalType`, wrapped in `CheckOverflow`, then cast to `Long`. ANSI overflow for `Long.MinValue div -1` and decimal-overflow ANSI cases are covered by existing tests.
- [ ] e
- [x] exp
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(StrictMath.exp, "EXP")` unchanged. ULP-level differences vs DataFusion `exp` are possible but unflagged.
- [x] expm1
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(StrictMath.expm1, "EXPM1")` unchanged.
- [x] factorial
- 3.4.3 (audited 2026-05-15): identical to v3.5.8.
- 3.5.8 (audited 2026-05-15): canonical reference; `extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant`. Returns NULL for NULL input or values outside `[0, 20]`.
- 4.0.1 (audited 2026-05-15): `NullIntolerant` trait replaced by `nullIntolerant: Boolean` method override; behavior unchanged.
- 4.1.1 (audited 2026-05-27): identical to 4.0.1.
- [x] floor
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): mirror of `ceil`. Two-arg `floor(expr, scale)` form (`RoundFloor`) falls back to Spark.
- [x] greatest
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): NULL-skipping variadic. Wired as `CometScalarFunction("greatest")` to DataFusion's `GreatestFunc`. Comet does not gate input types, so interval inputs and other Spark-only orderings rely on the native UDF accepting them; no explicit fallback path.
- [x] hex
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): accepts `LongType` / `BinaryType` / `StringType`. Spark 4.x widens `StringType` to `StringTypeWithCollation` and preserves collation in `dataType`; `CometHex` passes `expr.dataType` to native `SparkHex`, which always returns `Utf8` -- collation propagation may diverge on Spark 4.x.
- [ ] hypot
- [x] least
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): mirror of `greatest`; same caveats. Spark 4.1.1 adds `contextIndependentFoldable` (no Comet impact).
- [x] ln
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): registry alias for `Log`. Comet wires through `CometLog` to DataFusion `ln` with a `nullIfNegative` rewrite to match Spark's NULL behaviour for `x <= 0`.
- [x] log
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): one-arg `log(x)` -> `CometLog` (DataFusion `ln`); two-arg `log(base, x)` -> `CometLogarithm` (custom `spark_log` UDF, returns NULL when `base <= 0` or `x <= 0` to match `Logarithm.nullSafeEval`).
- [x] log10
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryLogExpression(StrictMath.log10, "LOG10")`; returns NULL for `x <= 0`. Possible ULP differences from `StrictMath`.
- [ ] log1p
- [x] log2
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryLogExpression(StrictMath.log(x) / StrictMath.log(2), "LOG2")`; returns NULL for `x <= 0`.
- [x] mod
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): registry alias for `Remainder`. Same support as `%`.
- [x] negative
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMinus(child, failOnError)` -> Rust `NegativeExpr`. ANSI overflow is detected for `Int8/Int16/Int32/Int64` and `IntervalYearMonth/IntervalDayTime`. Float / Double / Decimal cannot overflow on negate. Spark 4.0 `NullIntolerant` -> `nullIntolerant: Boolean` refactor; no impact.
- [x] pi
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `LeafMathExpression(math.Pi, "PI")`; foldable, so Spark `ConstantFolding` rewrites it to a `Literal` before Comet sees the plan. The `CometScalarFunction("pi")` registration is exercised only when `ConstantFolding` is excluded.
- [ ] pmod
- [x] positive
- Spark 3.4.3, 3.5.8 (audited 2026-05-27): `UnaryPositive(child)` is a regular expression. There is no Comet serde for `UnaryPositive`, so projections containing `+col` silently disable Comet for the projection on 3.4/3.5.
- Spark 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryPositive` is `RuntimeReplaceable` with `replacement = child`; the optimizer removes it before Comet sees the plan, so the gap is transparent on 4.x.
- [x] pow
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Pow(left, right) extends BinaryMathExpression(StrictMath.pow, "POWER")`; routes to DataFusion `pow`. ULP-level differences possible.
- [x] power
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): registry alias for `Pow`. Same support as `pow`.
- [x] radians
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.toRadians, "RADIANS")` unchanged across versions.
- [x] rand
- See `misc_funcs / rand`.
- [x] randn
- See `misc_funcs / randn`.
- [ ] random
- [ ] randstr
- [x] rint
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.rint, "ROUND")` with `funcName = "rint"`. Passthrough to DataFusion `rint` (round-half-to-even).
- [x] round
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): HALF_UP rounding for integer / decimal. Float / Double child types always fall back to Spark because `BigDecimal`-via-`toString` rounding cannot be precisely matched (documented inline in `CometRound`). ANSI `failOnError` is propagated for integer overflow. `BRound` (HALF_EVEN) is not wired.
- [x] sec
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): custom `1 / math.cos(x)`. Routed to datafusion-spark's `SparkSec`.
- [x] shiftleft
- See `bitwise_funcs / <<` (audited in PR #4479). Same support as the operator alias added in 4.0.
- [x] sign
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): registry alias for `Signum`. Same support as `signum`.
- [x] signum
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Signum(child)` over `DoubleType`. Spark also restricts to the two interval types via `inputTypes`; Comet handles only the `Double` case via DataFusion `signum`.
- [x] sin
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.sin, "SIN")` unchanged.
- [x] sinh
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.sinh, "SINH")` unchanged.
- [x] sqrt
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.sqrt, "SQRT")` unchanged.
- [x] tan
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.tan, "TAN")` unchanged.
- [x] tanh
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `UnaryMathExpression(math.tanh, "TANH")` unchanged.
- [x] try_add
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `TryAdd` is `RuntimeReplaceable` and rewrites to `Add(.., EvalMode.TRY)` for numeric inputs (datetime / interval go through `TryEval(Add(.., ANSI))` and fall back). Numeric path uses the Rust `checked_add` UDF, returning NULL on overflow. Decimal goes through `WideDecimalBinaryExpr` with `EvalMode.Try`.
- [x] try_divide
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `TryDivide` rewrites to `Divide(.., EvalMode.TRY)`. The `nullIfWhenPrimitive` wrapper swaps zero divisors to NULL; integer / float divide uses `checked_div`; decimal uses `decimal_div` + `CheckOverflow(failOnError = false)` returning NULL.
- [ ] try_mod
- [x] try_multiply
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): rewrites to `Multiply(.., EvalMode.TRY)`. Integer path uses `checked_mul`; decimal uses `WideDecimalBinaryExpr` with `EvalMode.Try`, returning NULL on overflow.
- [x] try_subtract
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): rewrites to `Subtract(.., EvalMode.TRY)`. Integer path uses `checked_sub`; decimal uses `WideDecimalBinaryExpr` as needed.
- [x] unhex
- Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 (audited 2026-05-27): `Unhex(child, failOnError)`. Spark 4.x widens input to `StringTypeWithCollation` and wraps the inner call in try/catch; Comet `CometUnhex` forwards `failOnError` to native `spark_unhex` but does not gate on collation.
- [ ] uniform
- [x] width_bucket
- Spark 3.5.8 (audited 2026-05-27): introduced; not available in 3.4.3.
- Spark 4.0.1, 4.1.1 (audited 2026-05-27): same semantics; `NullIntolerant` -> `nullIntolerant: Boolean` refactor.
- Known limitation: wired via per-version `CometExprShim` rather than a `CometExpressionSerde`, so it bypasses the support-level framework and the auto-generated compatibility doc (https://github.com/apache/datafusion-comet/issues/4485). Native path uses datafusion-spark `SparkWidthBucket`; interval input types are not exercised by Comet tests.

### misc_funcs

Expand Down