Describe the bug
Spark documents that array_max and array_min treat NaN as greater than any non-NaN value for Float/Double element arrays (Spark uses SQLOrderingUtil.compareFloats/compareDoubles). DataFusion's array_max/array_min go through Arrow's partial_cmp-based kernels, which produce IEEE semantics where NaN comparisons are unordered.
For arrays containing NaN, the two implementations produce different results:
array_max(array(double('NaN'), 1.0, 2.0)) returns NaN in Spark, may return 2.0 or NULL in Comet depending on kernel behaviour.
array_min(array(double('NaN'), 1.0, 2.0)) returns 1.0 in both, but the Comet path is fragile.
Surfaced by the array-expressions audit (collection PR queue). The single covering literal test in CometArrayExpressionSuite uses array(double('-Infinity'), 0.0, double('Infinity')) and does not contain a NaN, so the divergence is currently uncaught by CI.
Steps to reproduce
SELECT array_max(array(CAST('NaN' AS DOUBLE), 1.0, 2.0));
-- Spark: NaN
-- Comet: varies (likely 2.0 or NULL)
SELECT array_min(array(CAST('NaN' AS DOUBLE), 1.0, 2.0));
-- Spark: 1.0
-- Comet: varies
Expected behavior
Either implement Spark's NaN ordering on the Comet side or downgrade array_max / array_min to Incompatible(Some(...)) for FloatType / DoubleType element arrays so they only run with spark.comet.expression.ArrayMax.allowIncompatible=true (and the matching ArrayMin flag).
Additional context
- Comet serdes:
CometArrayMax, CometArrayMin in spark/src/main/scala/org/apache/comet/serde/arrays.scala.
- Spark reference:
ArrayMax.evalInternal / ArrayMin.evalInternal in collectionOperations.scala; uses getInterpretedOrdering which routes through SQLOrderingUtil for floats and doubles.
Describe the bug
Spark documents that
array_maxandarray_mintreat NaN as greater than any non-NaN value forFloat/Doubleelement arrays (Spark usesSQLOrderingUtil.compareFloats/compareDoubles). DataFusion'sarray_max/array_mingo through Arrow'spartial_cmp-based kernels, which produce IEEE semantics where NaN comparisons are unordered.For arrays containing NaN, the two implementations produce different results:
array_max(array(double('NaN'), 1.0, 2.0))returnsNaNin Spark, may return2.0orNULLin Comet depending on kernel behaviour.array_min(array(double('NaN'), 1.0, 2.0))returns1.0in both, but the Comet path is fragile.Surfaced by the array-expressions audit (collection PR queue). The single covering literal test in
CometArrayExpressionSuiteusesarray(double('-Infinity'), 0.0, double('Infinity'))and does not contain a NaN, so the divergence is currently uncaught by CI.Steps to reproduce
Expected behavior
Either implement Spark's NaN ordering on the Comet side or downgrade
array_max/array_mintoIncompatible(Some(...))forFloatType/DoubleTypeelement arrays so they only run withspark.comet.expression.ArrayMax.allowIncompatible=true(and the matchingArrayMinflag).Additional context
CometArrayMax,CometArrayMininspark/src/main/scala/org/apache/comet/serde/arrays.scala.ArrayMax.evalInternal/ArrayMin.evalInternalincollectionOperations.scala; usesgetInterpretedOrderingwhich routes throughSQLOrderingUtilfor floats and doubles.