Describe the bug
Spark 4.0+ supports collated StringType (e.g. STRING COLLATE UTF8_LCASE). CometCast.isSupported matches both source and target string types via case (DataTypes.StringType, _) => ... and case (_, DataTypes.StringType) => ..., where DataTypes.StringType is the singleton default-collation StringType instance.
Scala pattern equality means a non-default-collation StringType instance should NOT match DataTypes.StringType, and the cast would fall through to the default unsupported(...) branch and fall back to Spark. This appears to be the intended (safe) behaviour, but it is implicit: there is no isStringCollationType guard like the other string-touching serdes use (arrays.scala::CometArrayIntersect, QueryPlanSerde::supportedScalarSortElementType). And there is no test that asserts the fallback for CAST(c AS STRING COLLATE UTF8_LCASE) and CAST(c AS STRING) when c is collated.
If the equality ever yields true (e.g. via a future refactor of StringType.equals), Comet would silently route collated-string casts through the byte-oriented native path, producing incorrect results for downstream collation-aware comparisons / aggregations / hashing.
Surfaced by the cast audit (collection PR queue). Tracked under the umbrella #2190 for Spark 4.0 collation support.
Expected behavior
Either:
- Add an explicit
isStringCollationType guard in CometCast.isSupported so the fallback is declared rather than relying on Scala pattern semantics, OR
- Add tests asserting that
CAST to/from a non-default StringType collation falls back to Spark and does not run native.
Additional context
Describe the bug
Spark 4.0+ supports collated
StringType(e.g.STRING COLLATE UTF8_LCASE).CometCast.isSupportedmatches both source and target string types viacase (DataTypes.StringType, _) => ...andcase (_, DataTypes.StringType) => ..., whereDataTypes.StringTypeis the singleton default-collationStringTypeinstance.Scala pattern equality means a non-default-collation
StringTypeinstance should NOT matchDataTypes.StringType, and the cast would fall through to the defaultunsupported(...)branch and fall back to Spark. This appears to be the intended (safe) behaviour, but it is implicit: there is noisStringCollationTypeguard like the other string-touching serdes use (arrays.scala::CometArrayIntersect,QueryPlanSerde::supportedScalarSortElementType). And there is no test that asserts the fallback forCAST(c AS STRING COLLATE UTF8_LCASE)andCAST(c AS STRING)whencis collated.If the equality ever yields true (e.g. via a future refactor of
StringType.equals), Comet would silently route collated-string casts through the byte-oriented native path, producing incorrect results for downstream collation-aware comparisons / aggregations / hashing.Surfaced by the cast audit (collection PR queue). Tracked under the umbrella #2190 for Spark 4.0 collation support.
Expected behavior
Either:
isStringCollationTypeguard inCometCast.isSupportedso the fallback is declared rather than relying on Scala pattern semantics, ORCASTto/from a non-defaultStringTypecollation falls back to Spark and does not run native.Additional context
CometCast.scala, lines around thecase (DataTypes.StringType, _)andcase (_, DataTypes.StringType)matches.[Spark 4.0] Add string collation support).