chore(audit): audit hash expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1#4476
Open
andygrove wants to merge 1 commit into
Open
chore(audit): audit hash expressions across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1#4476andygrove wants to merge 1 commit into
andygrove wants to merge 1 commit into
Conversation
…, 4.1.1 Add per-version audit sub-bullets to `crc32`, `hash`, `md5`, `sha`, `sha1`, `sha2`, and `xxhash64` in `docs/source/contributor-guide/spark_expressions_support.md`. `sha` is a registry alias of `Sha1`. Spark 4.0 only adds the `DefaultStringProducingExpression` trait and the `nullIntolerant` field refactor across this category; no runtime behaviour change. Apply support-level consistency fixes surfaced by the audit: - Refactor `HashUtils` to return reasons (`unsupportedReasonFor`, `supportLevelForChildren`, `unsupportedReasons`) instead of calling `withInfo` from inside the helper. The recursive type check no longer side-effects on the expression tree at type-check time. - `CometXxHash64`, `CometMurmur3Hash`, `CometSha1`, `CometSha2`: override `getSupportLevel` and `getUnsupportedReasons` so the unsupported-child-type and (for Sha2) the non-foldable-numBits restrictions reach the dispatcher and the compatibility doc. No correctness divergences were found, so no new tracking issues are filed. The known `TimeType` gap (Spark 4.0+) is covered by the existing apache#4418 EPIC; the `DecimalType`-precision-18 gap is a documented Spark semantic difference (BigDecimal hashing).
This was referenced May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #.
Rationale for this change
Continuation of the per-category expression audit. Same pattern as #4475 (conditional), #4474 (misc), #4473 (collection), #4470 (json), #4469 (struct), using the updated
audit-comet-expressionskill in #4468.What changes are included in this PR?
Support-doc audit notes
Add per-version audit sub-bullets to
crc32,hash,md5,sha,sha1,sha2, andxxhash64.shais a registry alias ofSha1. Spark 4.0 only adds theDefaultStringProducingExpressiontrait and thenullIntolerant: Booleanfield refactor on the fourString-producing expressions (Md5,Sha1,Sha2,Crc32); no runtime behaviour change across the category.Support-level consistency fixes (in
hash.scala)HashUtilsto return reasons (unsupportedReasonFor,supportLevelForChildren,unsupportedReasons) instead of callingwithInfofrom inside the helper. The recursive type check no longer side-effects on the expression tree at type-check time, which the audit skill calls out as the canonical antipattern.CometXxHash64,CometMurmur3Hash,CometSha1,CometSha2: overridegetSupportLevelandgetUnsupportedReasonsso the unsupported-child-type and (forSha2) the non-foldable-numBitsrestrictions reach both the dispatcher's EXPLAIN message and the compatibility doc generator.Tracking issues filed for follow-up
None. The
TimeTypegap (Spark 4.0+) is covered by the existing #4418 EPIC; theDecimalType-precision-18 gap is a documented semantic difference (Spark hashes via JavaBigDecimal), already declared by the newHashUtils.unsupportedReasons.Audit process
Audited directly using the
audit-comet-expressionskill (4 Spark versions per #4468). Four serde objects plus the sharedHashUtilshelper.How are these changes tested?
./mvnw test -Dsuites="org.apache.comet.CometHashExpressionSuite" -Dtest=none(37 tests pass)make coresucceeds with the serde refactor.