[Bug] translate uses graphemes vs Spark code points and ignores U+0000 deletion

## Describe the bug

`translate` is wired as `CometScalarFunction("translate")` and currently reports `Compatible`, but DataFusion's `translate` diverges from Spark in two ways:

1. **Grapheme vs code-point semantics.** DataFusion iterates over Unicode graphemes; Spark uses code points (via `Character.charCount`). For supplementary BMP code points these match, but for multi-code-point graphemes (combining marks, ZWJ sequences such as flag emoji) the two implementations disagree.
2. **NUL byte in the `to` argument.** Spark's `StringTranslate.buildDict` treats any character mapped to U+0000 in `to` as a deletion. DataFusion substitutes U+0000 instead.

Surfaced by the string-expressions audit in apache/datafusion-comet#4461.

## Steps to reproduce

```sql
-- (1) grapheme vs code point: combining mark
SELECT translate(concat('e', char(0x0301)), 'e', 'a');

-- (2) U+0000 deletion: expected to delete 'b' under Spark
SELECT translate('abc', 'b', char(0));
```

Spark deletes the matched character in the second query; Comet substitutes a NUL character. Spark's per-code-point translation and Comet's grapheme-based translation diverge for combining-mark inputs.

## Expected behavior

Match Spark behavior, or downgrade `translate` to `Incompatible(Some(...))` so the non-ASCII path falls back unless explicitly enabled.

## Additional context

- Comet wiring: `QueryPlanSerde.scala` -> `classOf[StringTranslate] -> CometScalarFunction("translate")`
- Spark reference: `UTF8String.translate(dict)` with `StringTranslate.buildDict`
- DataFusion impl: `datafusion-functions::unicode::translate`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] translate uses graphemes vs Spark code points and ignores U+0000 deletion #4463

Describe the bug

Steps to reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] translate uses graphemes vs Spark code points and ignores U+0000 deletion #4463

Description

Describe the bug

Steps to reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions