Skip to content

[Bug] translate uses graphemes vs Spark code points and ignores U+0000 deletion #4463

@andygrove

Description

@andygrove

Describe the bug

translate is wired as CometScalarFunction("translate") and currently reports Compatible, but DataFusion's translate diverges from Spark in two ways:

  1. Grapheme vs code-point semantics. DataFusion iterates over Unicode graphemes; Spark uses code points (via Character.charCount). For supplementary BMP code points these match, but for multi-code-point graphemes (combining marks, ZWJ sequences such as flag emoji) the two implementations disagree.
  2. NUL byte in the to argument. Spark's StringTranslate.buildDict treats any character mapped to U+0000 in to as a deletion. DataFusion substitutes U+0000 instead.

Surfaced by the string-expressions audit in #4461.

Steps to reproduce

-- (1) grapheme vs code point: combining mark
SELECT translate(concat('e', char(0x0301)), 'e', 'a');

-- (2) U+0000 deletion: expected to delete 'b' under Spark
SELECT translate('abc', 'b', char(0));

Spark deletes the matched character in the second query; Comet substitutes a NUL character. Spark's per-code-point translation and Comet's grapheme-based translation diverge for combining-mark inputs.

Expected behavior

Match Spark behavior, or downgrade translate to Incompatible(Some(...)) so the non-ASCII path falls back unless explicitly enabled.

Additional context

  • Comet wiring: QueryPlanSerde.scala -> classOf[StringTranslate] -> CometScalarFunction("translate")
  • Spark reference: UTF8String.translate(dict) with StringTranslate.buildDict
  • DataFusion impl: datafusion-functions::unicode::translate

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions