Describe the bug
translate is wired as CometScalarFunction("translate") and currently reports Compatible, but DataFusion's translate diverges from Spark in two ways:
- Grapheme vs code-point semantics. DataFusion iterates over Unicode graphemes; Spark uses code points (via
Character.charCount). For supplementary BMP code points these match, but for multi-code-point graphemes (combining marks, ZWJ sequences such as flag emoji) the two implementations disagree.
- NUL byte in the
to argument. Spark's StringTranslate.buildDict treats any character mapped to U+0000 in to as a deletion. DataFusion substitutes U+0000 instead.
Surfaced by the string-expressions audit in #4461.
Steps to reproduce
-- (1) grapheme vs code point: combining mark
SELECT translate(concat('e', char(0x0301)), 'e', 'a');
-- (2) U+0000 deletion: expected to delete 'b' under Spark
SELECT translate('abc', 'b', char(0));
Spark deletes the matched character in the second query; Comet substitutes a NUL character. Spark's per-code-point translation and Comet's grapheme-based translation diverge for combining-mark inputs.
Expected behavior
Match Spark behavior, or downgrade translate to Incompatible(Some(...)) so the non-ASCII path falls back unless explicitly enabled.
Additional context
- Comet wiring:
QueryPlanSerde.scala -> classOf[StringTranslate] -> CometScalarFunction("translate")
- Spark reference:
UTF8String.translate(dict) with StringTranslate.buildDict
- DataFusion impl:
datafusion-functions::unicode::translate
Describe the bug
translateis wired asCometScalarFunction("translate")and currently reportsCompatible, but DataFusion'stranslatediverges from Spark in two ways:Character.charCount). For supplementary BMP code points these match, but for multi-code-point graphemes (combining marks, ZWJ sequences such as flag emoji) the two implementations disagree.toargument. Spark'sStringTranslate.buildDicttreats any character mapped to U+0000 intoas a deletion. DataFusion substitutes U+0000 instead.Surfaced by the string-expressions audit in #4461.
Steps to reproduce
Spark deletes the matched character in the second query; Comet substitutes a NUL character. Spark's per-code-point translation and Comet's grapheme-based translation diverge for combining-mark inputs.
Expected behavior
Match Spark behavior, or downgrade
translatetoIncompatible(Some(...))so the non-ASCII path falls back unless explicitly enabled.Additional context
QueryPlanSerde.scala->classOf[StringTranslate] -> CometScalarFunction("translate")UTF8String.translate(dict)withStringTranslate.buildDictdatafusion-functions::unicode::translate