[Bug] CAST(BinaryType AS StringType) uses unsafe from_utf8_unchecked (undefined behaviour)

## Describe the bug

`CAST(<binary> AS STRING)` in Comet runs through `cast_binary_formatter` in `native/spark-expr/src/conversion_funcs/cast.rs`, which uses `unsafe { String::from_utf8_unchecked(value.to_vec()) }` to convert non-UTF8 bytes into a `String`. This is undefined behaviour in Rust (`String` is a documented invariant: its bytes must be valid UTF-8), and any downstream code that relies on that invariant (e.g. iterating over chars, slicing on char boundaries) can misbehave for inputs that are not valid UTF-8.

Spark's `UTF8String.fromBytes` does not validate either, but it stores the bytes in a non-`String`-typed buffer, so it does not violate any Java-level invariant. The Comet path is the dangerous case.

Surfaced by the cast audit (collection PR queue). Today's `CometCast.isSupported((BinaryType, StringType), ...)` returns `Compatible(None)` so this path runs by default for any binary column whose contents are not strictly valid UTF-8.

## Steps to reproduce

```sql
SELECT CAST(X'FF' AS STRING);
```

The result is byte-for-byte what Spark produces today (a one-byte string holding `0xFF`), but the path through `from_utf8_unchecked` is UB and is therefore not guaranteed to keep producing that result under future Rust compiler / Arrow versions.

## Expected behavior

Replace the `from_utf8_unchecked` call with a safe equivalent. Options:

1. Reinterpret the underlying buffer of `BinaryArray` as a `StringArray` without copying (Arrow stores both as the same byte layout): construct the `StringArray` directly from the buffers without going through `String::from_utf8_unchecked` on a freshly allocated `Vec<u8>`.
2. If a `String` is genuinely needed, use `String::from_utf8_lossy(value).into_owned()` and accept the U+FFFD replacement on invalid sequences (this diverges from Spark for invalid bytes but is safe).

Option 1 matches Spark exactly without UB.

## Additional context

- Native impl: `native/spark-expr/src/conversion_funcs/cast.rs::cast_binary_formatter` (around line 865)
- Comet matrix: `CometCast.canCastFromBinary` and `canCastToString` in `spark/src/main/scala/org/apache/comet/expressions/CometCast.scala`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] CAST(BinaryType AS StringType) uses unsafe from_utf8_unchecked (undefined behaviour) #4488

Describe the bug

Steps to reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] CAST(BinaryType AS StringType) uses unsafe from_utf8_unchecked (undefined behaviour) #4488

Description

Describe the bug

Steps to reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions