Skip to content

chore(audit): audit remaining array expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1#4483

Open
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:worktree-audit-array-funcs
Open

chore(audit): audit remaining array expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1#4483
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:worktree-audit-array-funcs

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #.

Rationale for this change

Continuation of the per-category expression audit. Same pattern as #4480 (predicate), #4479 (bitwise), #4478 (map), #4476 (hash), #4475 (conditional), #4474 (misc), #4473 (collection), #4470 (json), #4469 (struct), using the updated audit-comet-expression skill in #4468.

What changes are included in this PR?

Support-doc audit notes

Add per-version audit sub-bullets to all 19 not-yet-audited array expressions (array, array_append, array_compact, array_contains, array_distinct, array_except, array_join, array_max, array_min, array_position, array_remove, array_repeat, array_union, arrays_overlap, arrays_zip, element_at, flatten, get, sort_array). Add 4.1.1 audit lines to the already-audited array_insert and array_intersect.

Highlights from the cross-version review:

  • Spark 4.0 across the category does the NullIntolerant -> nullIntolerant: Boolean field refactor and (for elements participating in collation) widens StringType inputs to StringTypeWithCollation.
  • ArrayAppend becomes RuntimeReplaceable in 4.0 (rewrites to ArrayInsert(arr, -1, elem)); CometArrayAppend is unreachable in 4.0+.
  • ArrayCompact is always RuntimeReplaceable; Comet dispatches through CometArrayFilter.
  • SortArray 4.0 widens ascendingOrder from a Literal to any foldable boolean; CometSortArray still requires Literal.
  • ElementAt and GetArrayItem ANSI default flips to true in 4.0.

Support-level consistency fixes (in arrays.scala)

  • CometArrayExcept: Incompatible(None) -> Incompatible(Some(reason)) via a shared private val so the EXPLAIN message matches the compatibility-guide text.
  • CometArrayJoin: same fix.

Tracking issues filed for follow-up

Existing #3178 already documents the array_join null-handling gap and is referenced from the array_join sub-bullet. Other findings from the audit (e.g. lifting convert-time restrictions into getSupportLevel for CometArrayPosition, CometArrayRemove, CometElementAt, CometFlatten, CometSortArray.ascendingOrder-foldable; relabeling CometArrayCompact/CometArrayAppend dead registrations) are noted in the doc but are too invasive for this audit PR; they can ride in follow-ups.

Audit process

Audited using the audit-comet-expression skill (4 Spark versions per #4468), driven by 3 parallel agents covering creation/access, set/search, and aggregate/misc groups.

How are these changes tested?

  • ./mvnw test -Dsuites="org.apache.comet.CometArrayExpressionSuite" -Dtest=none (39 tests pass)
  • ./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite expressions/array/" -Dtest=none (34 tests pass)
  • make core succeeds.

….5.8, 4.0.1, 4.1.1

Add per-version audit sub-bullets to all 19 not-yet-audited array
expressions: `array`, `array_append`, `array_compact`,
`array_contains`, `array_distinct`, `array_except`, `array_join`,
`array_max`, `array_min`, `array_position`, `array_remove`,
`array_repeat`, `array_union`, `arrays_overlap`, `arrays_zip`,
`element_at`, `flatten`, `get`, `sort_array`. Also add the 4.1.1
audit line to the already-audited `array_insert` and `array_intersect`.

Highlights from the cross-version review:

- Spark 4.0 across the category does the `NullIntolerant` -> `nullIntolerant` field refactor and (for elements participating in collation) widens `StringType` inputs to `StringTypeWithCollation`.
- `ArrayAppend` becomes `RuntimeReplaceable` in 4.0 (rewrites to `ArrayInsert(arr, -1, elem)`); `CometArrayAppend` is unreachable in 4.0+.
- `ArrayCompact` is always `RuntimeReplaceable`; Comet dispatches through `CometArrayFilter`.
- `SortArray` 4.0 widens `ascendingOrder` from a `Literal` to any foldable boolean; `CometSortArray` still requires `Literal`.
- `ElementAt` and `GetArrayItem` ANSI default flips to true in 4.0.

Apply two support-level consistency fixes surfaced by the audit:

- `CometArrayExcept`: `Incompatible(None)` -> `Incompatible(Some(reason))` via a shared `private val` so the EXPLAIN message matches the compatibility-guide text.
- `CometArrayJoin`: same fix.

Tracking issues filed for the correctness gaps found:

- apache#4481 `array_distinct` / `array_union` / `array_except` do not canonicalize NaN / signed-zero like Spark's `SQLOpenHashSet`.
- apache#4482 `array_max` / `array_min` disagree with Spark on NaN ordering.

Existing apache#3178 already documents the `array_join` null-handling gap and is referenced from the `array_join` sub-bullet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant