Skip to content

[Bug] array_distinct / array_union / array_except do not canonicalize NaN like Spark #4481

@andygrove

Description

@andygrove

Describe the bug

Spark's array_distinct, array_union, and array_except use SQLOpenHashSet.withNaNCheckFunc to canonicalize NaN equality: every NaN bit pattern collapses to a single bucket, and +0.0 / -0.0 are treated as distinct. DataFusion's array_distinct, array_union, and array_except use hash-based set logic where NaN is not equal to NaN (IEEE semantics) and +0.0 == -0.0.

So for Float/Double element arrays containing NaN or signed zero, the results diverge silently:

  • array_distinct(array(double('NaN'), double('NaN'))) returns one element in Spark, two in Comet.
  • array_union(array(0.0), array(-0.0)) returns one element in Spark, two in Comet.

Surfaced by the array-expressions audit (collection PR queue).

Steps to reproduce

SELECT array_distinct(array(CAST('NaN' AS DOUBLE), CAST('NaN' AS DOUBLE)));
-- Spark: [NaN]      Comet: [NaN, NaN]

SELECT array_union(array(0.0), array(-0.0));
-- Spark: [0.0]      Comet: [0.0, -0.0]

Expected behavior

Either canonicalize NaN/signed-zero on the Comet side before invoking the DataFusion set operations, or downgrade array_distinct / array_union / array_except to Incompatible(Some(...)) whenever the element type is FloatType or DoubleType. (array_intersect already has a related Incompatible flag for ordering; the NaN gap is separate.)

Additional context

  • Comet serdes: CometScalarFunction("array_distinct") (registered in QueryPlanSerde.scala), CometArrayUnion and CometArrayExcept in spark/src/main/scala/org/apache/comet/serde/arrays.scala.
  • Spark reference: SQLOpenHashSet.withNaNCheckFunc and ArraySetLike in collectionOperations.scala.
  • Related: array_intersect carries an ordering Incompatible flag but does not document the NaN gap either.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions