chore(audit): audit cast across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1#4493
Open
andygrove wants to merge 1 commit into
Open
chore(audit): audit cast across Spark 3.4.3, 3.5.8, 4.0.1, 4.1.1#4493andygrove wants to merge 1 commit into
andygrove wants to merge 1 commit into
Conversation
Add the per-version audit sub-bullet to `cast` in `docs/source/contributor-guide/spark_expressions_support.md`. Cast is wired through `CometCast` (`spark/src/main/scala/org/apache/comet/expressions/CometCast.scala`) with a per-source-type support matrix and per-eval-mode (`LEGACY`/`ANSI`/`TRY`) handling. The native side implements explicit narrowing-numeric arms with Spark-shaped error messages and falls through to DataFusion `cast_with_options` for the rest. Spark's `Cast.canCast` numeric-to-numeric matrix is unchanged across 3.4 / 3.5 / 4.0 / 4.1. Spark 4.0 adds `VariantType` and collation- aware `StringType`; Spark 4.1 adds `TimeType` and geospatial types with their own arms. None of those new arms are in `CometCast`, so they fall back to Spark. Tracking issues filed for the gaps found: - apache#4488 `CAST(binary AS string)` uses unsafe `from_utf8_unchecked`. - apache#4489 Spark 4.0 collated-string casts are implicitly unhandled, no test. - apache#4490 Spark 4.1 `TimeType` casts have no explicit `Unsupported` arm. - apache#4491 `CAST(map AS map)` falls back even though native `cast_map_to_map` exists. - apache#4492 `spark.sql.legacy.castComplexTypesToString.enabled` is not honoured. Existing apache#1371 (`CAST(float|double AS decimal)` rounding) and apache#2190 (Spark 4.0 collation umbrella) are also referenced. The audit was driven by 3 parallel agents covering high-level + numeric, string conversions, and datetime + complex.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #.
Rationale for this change
Completes the per-category expression audit started in earlier PRs (#4469 struct, #4470 json, #4473 collection, #4474 misc, #4475 conditional, #4476 hash, #4478 map, #4479 bitwise, #4480 predicate, #4483 array, #4486 math), using the updated
audit-comet-expressionskill in #4468.castis the only expression inconversion_funcs; the other entries (bigint,binary,boolean, ...) are SQL-syntactic shorthands forCAST(... AS X)and resolve toCast.What changes are included in this PR?
Support-doc audit notes
Add a per-version audit sub-bullet to
castindocs/source/contributor-guide/spark_expressions_support.md. Highlights:Cast(child, dataType, timeZoneId, evalMode)signature is stable across 3.4 / 3.5 / 4.0 / 4.1; per-versionEvalModeresolution lives inCometExprShim.CometCast.scalathat returnsCompatible/Incompatible(reason)/Unsupported(reason)for each (from, to) pair.canCastmatrix is unchanged across all four versions; Spark 4.0 addsVariantTypeand collatedStringType; Spark 4.1 addsTimeTypeand geospatial types with their own arms. None of those new arms are inCometCast, so they fall back to Spark.Support-level consistency fixes
None in this PR. The audit surfaced cosmetic inconsistencies (e.g.
canCastFromFloat/Double/DecimalignoreevalMode; the staticgetUnsupportedReasons/getIncompatibleReasonsdon't enumerate the per-pair reasons returned byisSupported; onecase _: Incompatible()with no reason in the array-of-binary branch) but these touch a load-bearing file and are deferred to follow-ups.Tracking issues filed for follow-up
CAST(<binary> AS STRING)uses unsafeString::from_utf8_unchecked(undefined behaviour for non-UTF8 inputs).TimeTypecasts have no explicitUnsupportedarm; falls back implicitly but doesn't appear in the auto-generated compat doc.CAST(<map> AS <map>)falls back even though nativecast_map_to_mapexists.spark.sql.legacy.castComplexTypesToString.enabled=trueis not honoured.Existing #1371 (
CAST(float|double AS decimal)rounding) and #2190 (Spark 4.0 collation umbrella) are also referenced from the doc sub-bullet.Audit process
Audited using the
audit-comet-expressionskill (4 Spark versions per #4468), driven by 3 parallel agents covering high-level + numeric, string conversions, and datetime + complex types.How are these changes tested?
make coresucceeds (no code changes; doc only).CometCastSuiteand the cast-related SQL fixtures remains unchanged.