Describe the bug
Spark 4.0 refactored StringDecode from a BinaryExpression to a RuntimeReplaceable whose replacement is StaticInvoke(StringDecode.decode, bin, charset, legacyCharsets, legacyErrorAction). The two new boolean arguments control malformed-character handling: with legacyErrorAction = true, Spark substitutes replacement characters for invalid UTF-8 sequences (matching the Spark 3.x behavior); with legacyErrorAction = false (the default), Spark raises QueryExecutionErrors.malformedCharacterCoding(...).
Comet's Spark 4.0 shim (spark/src/main/spark-4.0/org/apache/comet/shims/CometExprShim.scala) destructures the StaticInvoke arguments and discards both flags, then routes through CommonStringExprs.stringDecode which always lowers to Cast(bin, StringType, TRY). The Cast TRY path produces NULL on invalid UTF-8 in all cases. That means:
- Under Spark 4.0 default mode (
legacyErrorAction = false): Spark raises, Comet returns NULL.
- Under Spark 4.0 legacy mode (
legacyErrorAction = true): Spark substitutes replacement characters, Comet returns NULL.
- Under Spark 3.x: Spark substitutes replacement characters, Comet returns NULL.
Surfaced by the string-expressions audit in #4461.
Steps to reproduce
SET spark.sql.legacy.javaCharsets = true;
SELECT decode(X'FF', 'UTF-8');
Spark 3.x: returns ? (Unicode replacement).
Spark 4.0 (legacy mode): same as 3.x.
Spark 4.0 (default mode): raises MALFORMED_CHARACTER_CODING.
Comet: returns NULL in all three cases.
Expected behavior
Honor legacyCharsets / legacyErrorAction when running under Spark 4.0+. At minimum, the flags should be propagated through the proto so the native impl can choose between the substitute/throw/null modes.
Additional context
Describe the bug
Spark 4.0 refactored
StringDecodefrom aBinaryExpressionto aRuntimeReplaceablewhosereplacementisStaticInvoke(StringDecode.decode, bin, charset, legacyCharsets, legacyErrorAction). The two new boolean arguments control malformed-character handling: withlegacyErrorAction = true, Spark substitutes replacement characters for invalid UTF-8 sequences (matching the Spark 3.x behavior); withlegacyErrorAction = false(the default), Spark raisesQueryExecutionErrors.malformedCharacterCoding(...).Comet's Spark 4.0 shim (
spark/src/main/spark-4.0/org/apache/comet/shims/CometExprShim.scala) destructures theStaticInvokearguments and discards both flags, then routes throughCommonStringExprs.stringDecodewhich always lowers toCast(bin, StringType, TRY). The Cast TRY path produces NULL on invalid UTF-8 in all cases. That means:legacyErrorAction = false): Spark raises, Comet returns NULL.legacyErrorAction = true): Spark substitutes replacement characters, Comet returns NULL.Surfaced by the string-expressions audit in #4461.
Steps to reproduce
Spark 3.x: returns
?(Unicode replacement).Spark 4.0 (legacy mode): same as 3.x.
Spark 4.0 (default mode): raises
MALFORMED_CHARACTER_CODING.Comet: returns NULL in all three cases.
Expected behavior
Honor
legacyCharsets/legacyErrorActionwhen running under Spark 4.0+. At minimum, the flags should be propagated through the proto so the native impl can choose between the substitute/throw/null modes.Additional context
spark/src/main/spark-4.0/org/apache/comet/shims/CometExprShim.scala(andspark-4.1,spark-4.2)CommonStringExprs.stringDecodeinspark/src/main/scala/org/apache/comet/serde/strings.scala