Describe the bug
Spark 4.1.1 added a legacySplitTruncate flag to StringToMap (and several other split-based expressions) that is driven by spark.sql.legacy.truncateForEmptyRegexSplit. When the flag is enabled (legacy mode), CollationAwareUTF8String.splitSQL truncates trailing empty matches from the split result. Comet's native str_to_map does not honour this flag and always behaves as if it were false (the non-legacy default).
Surfaced by the map-expressions audit (collection PR queue).
Steps to reproduce
SET spark.sql.legacy.truncateForEmptyRegexSplit = true;
SELECT str_to_map('a:1,,b:2,', ',', ':');
Spark 4.1.1 with the legacy flag enabled would truncate the trailing empty ("","null") entry; the non-legacy default and Comet keep it.
Expected behavior
Either propagate the flag through the proto so the native impl can branch on it, or downgrade CometStrToMap to Incompatible when spark.sql.legacy.truncateForEmptyRegexSplit=true on Spark 4.1.1+.
Additional context
- Comet wiring:
QueryPlanSerde.scala -> classOf[StringToMap] -> CometStrToMap (a one-liner over CometScalarFunction("str_to_map")).
- Spark 4.1.1 change:
complexTypeCreator.scala:604-606:
private lazy val legacySplitTruncate =
SQLConf.get.getConf(SQLConf.LEGACY_TRUNCATE_FOR_EMPTY_REGEX_SPLIT)
- The same flag also affects
StringSplit (already-tracked behaviour difference there) and other split-based ops in 4.1.1.
Describe the bug
Spark 4.1.1 added a
legacySplitTruncateflag toStringToMap(and several other split-based expressions) that is driven byspark.sql.legacy.truncateForEmptyRegexSplit. When the flag is enabled (legacy mode),CollationAwareUTF8String.splitSQLtruncates trailing empty matches from the split result. Comet's nativestr_to_mapdoes not honour this flag and always behaves as if it werefalse(the non-legacy default).Surfaced by the map-expressions audit (collection PR queue).
Steps to reproduce
Spark 4.1.1 with the legacy flag enabled would truncate the trailing empty
("","null")entry; the non-legacy default and Comet keep it.Expected behavior
Either propagate the flag through the proto so the native impl can branch on it, or downgrade
CometStrToMaptoIncompatiblewhenspark.sql.legacy.truncateForEmptyRegexSplit=trueon Spark 4.1.1+.Additional context
QueryPlanSerde.scala->classOf[StringToMap] -> CometStrToMap(a one-liner overCometScalarFunction("str_to_map")).complexTypeCreator.scala:604-606:StringSplit(already-tracked behaviour difference there) and other split-based ops in 4.1.1.