Skip to content

[Bug] decode ignores Spark 4.0 legacyCharsets and legacyErrorAction flags #4465

@andygrove

Description

@andygrove

Describe the bug

Spark 4.0 refactored StringDecode from a BinaryExpression to a RuntimeReplaceable whose replacement is StaticInvoke(StringDecode.decode, bin, charset, legacyCharsets, legacyErrorAction). The two new boolean arguments control malformed-character handling: with legacyErrorAction = true, Spark substitutes replacement characters for invalid UTF-8 sequences (matching the Spark 3.x behavior); with legacyErrorAction = false (the default), Spark raises QueryExecutionErrors.malformedCharacterCoding(...).

Comet's Spark 4.0 shim (spark/src/main/spark-4.0/org/apache/comet/shims/CometExprShim.scala) destructures the StaticInvoke arguments and discards both flags, then routes through CommonStringExprs.stringDecode which always lowers to Cast(bin, StringType, TRY). The Cast TRY path produces NULL on invalid UTF-8 in all cases. That means:

  • Under Spark 4.0 default mode (legacyErrorAction = false): Spark raises, Comet returns NULL.
  • Under Spark 4.0 legacy mode (legacyErrorAction = true): Spark substitutes replacement characters, Comet returns NULL.
  • Under Spark 3.x: Spark substitutes replacement characters, Comet returns NULL.

Surfaced by the string-expressions audit in #4461.

Steps to reproduce

SET spark.sql.legacy.javaCharsets = true;
SELECT decode(X'FF', 'UTF-8');

Spark 3.x: returns ? (Unicode replacement).
Spark 4.0 (legacy mode): same as 3.x.
Spark 4.0 (default mode): raises MALFORMED_CHARACTER_CODING.
Comet: returns NULL in all three cases.

Expected behavior

Honor legacyCharsets / legacyErrorAction when running under Spark 4.0+. At minimum, the flags should be propagated through the proto so the native impl can choose between the substitute/throw/null modes.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions