chore(audit): audit hash expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1 by andygrove · Pull Request #4476 · apache/datafusion-comet

andygrove · 2026-05-27T23:23:27Z

Which issue does this PR close?

Closes #.

Rationale for this change

Continuation of the per-category expression audit. Same pattern as #4475 (conditional), #4474 (misc), #4473 (collection), #4470 (json), #4469 (struct), using the updated audit-comet-expression skill in #4468.

What changes are included in this PR?

Support-doc audit notes

Add per-version audit sub-bullets to crc32, hash, md5, sha, sha1, sha2, and xxhash64. sha is a registry alias of Sha1. Spark 4.0 only adds the DefaultStringProducingExpression trait and the nullIntolerant: Boolean field refactor on the four String-producing expressions (Md5, Sha1, Sha2, Crc32); no runtime behaviour change across the category.

Support-level consistency fixes (in `hash.scala`)

Refactor HashUtils to return reasons (unsupportedReasonFor, supportLevelForChildren, unsupportedReasons) instead of calling withInfo from inside the helper. The recursive type check no longer side-effects on the expression tree at type-check time, which the audit skill calls out as the canonical antipattern.
CometXxHash64, CometMurmur3Hash, CometSha1, CometSha2: override getSupportLevel and getUnsupportedReasons so the unsupported-child-type and (for Sha2) the non-foldable-numBits restrictions reach both the dispatcher's EXPLAIN message and the compatibility doc generator.

Tracking issues filed for follow-up

None. The TimeType gap (Spark 4.0+) is covered by the existing #4418 EPIC; the DecimalType-precision-18 gap is a documented semantic difference (Spark hashes via Java BigDecimal), already declared by the new HashUtils.unsupportedReasons.

Audit process

Audited directly using the audit-comet-expression skill (4 Spark versions per #4468). Four serde objects plus the shared HashUtils helper.

How are these changes tested?

./mvnw test -Dsuites="org.apache.comet.CometHashExpressionSuite" -Dtest=none (37 tests pass)
make core succeeds with the serde refactor.

…, 4.1.1 Add per-version audit sub-bullets to `crc32`, `hash`, `md5`, `sha`, `sha1`, `sha2`, and `xxhash64` in `docs/source/contributor-guide/spark_expressions_support.md`. `sha` is a registry alias of `Sha1`. Spark 4.0 only adds the `DefaultStringProducingExpression` trait and the `nullIntolerant` field refactor across this category; no runtime behaviour change. Apply support-level consistency fixes surfaced by the audit: - Refactor `HashUtils` to return reasons (`unsupportedReasonFor`, `supportLevelForChildren`, `unsupportedReasons`) instead of calling `withInfo` from inside the helper. The recursive type check no longer side-effects on the expression tree at type-check time. - `CometXxHash64`, `CometMurmur3Hash`, `CometSha1`, `CometSha2`: override `getSupportLevel` and `getUnsupportedReasons` so the unsupported-child-type and (for Sha2) the non-foldable-numBits restrictions reach the dispatcher and the compatibility doc. No correctness divergences were found, so no new tracking issues are filed. The known `TimeType` gap (Spark 4.0+) is covered by the existing apache#4418 EPIC; the `DecimalType`-precision-18 gap is a documented Spark semantic difference (BigDecimal hashing).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(audit): audit hash expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1#4476

chore(audit): audit hash expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1#4476
andygrove wants to merge 1 commit into
apache:mainfrom
andygrove:worktree-audit-hash-funcs

andygrove commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 27, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Support-doc audit notes

Support-level consistency fixes (in hash.scala)

Tracking issues filed for follow-up

Audit process

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Support-level consistency fixes (in `hash.scala`)