Describe the bug
CAST(<binary> AS STRING) in Comet runs through cast_binary_formatter in native/spark-expr/src/conversion_funcs/cast.rs, which uses unsafe { String::from_utf8_unchecked(value.to_vec()) } to convert non-UTF8 bytes into a String. This is undefined behaviour in Rust (String is a documented invariant: its bytes must be valid UTF-8), and any downstream code that relies on that invariant (e.g. iterating over chars, slicing on char boundaries) can misbehave for inputs that are not valid UTF-8.
Spark's UTF8String.fromBytes does not validate either, but it stores the bytes in a non-String-typed buffer, so it does not violate any Java-level invariant. The Comet path is the dangerous case.
Surfaced by the cast audit (collection PR queue). Today's CometCast.isSupported((BinaryType, StringType), ...) returns Compatible(None) so this path runs by default for any binary column whose contents are not strictly valid UTF-8.
Steps to reproduce
SELECT CAST(X'FF' AS STRING);
The result is byte-for-byte what Spark produces today (a one-byte string holding 0xFF), but the path through from_utf8_unchecked is UB and is therefore not guaranteed to keep producing that result under future Rust compiler / Arrow versions.
Expected behavior
Replace the from_utf8_unchecked call with a safe equivalent. Options:
- Reinterpret the underlying buffer of
BinaryArray as a StringArray without copying (Arrow stores both as the same byte layout): construct the StringArray directly from the buffers without going through String::from_utf8_unchecked on a freshly allocated Vec<u8>.
- If a
String is genuinely needed, use String::from_utf8_lossy(value).into_owned() and accept the U+FFFD replacement on invalid sequences (this diverges from Spark for invalid bytes but is safe).
Option 1 matches Spark exactly without UB.
Additional context
- Native impl:
native/spark-expr/src/conversion_funcs/cast.rs::cast_binary_formatter (around line 865)
- Comet matrix:
CometCast.canCastFromBinary and canCastToString in spark/src/main/scala/org/apache/comet/expressions/CometCast.scala.
Describe the bug
CAST(<binary> AS STRING)in Comet runs throughcast_binary_formatterinnative/spark-expr/src/conversion_funcs/cast.rs, which usesunsafe { String::from_utf8_unchecked(value.to_vec()) }to convert non-UTF8 bytes into aString. This is undefined behaviour in Rust (Stringis a documented invariant: its bytes must be valid UTF-8), and any downstream code that relies on that invariant (e.g. iterating over chars, slicing on char boundaries) can misbehave for inputs that are not valid UTF-8.Spark's
UTF8String.fromBytesdoes not validate either, but it stores the bytes in a non-String-typed buffer, so it does not violate any Java-level invariant. The Comet path is the dangerous case.Surfaced by the cast audit (collection PR queue). Today's
CometCast.isSupported((BinaryType, StringType), ...)returnsCompatible(None)so this path runs by default for any binary column whose contents are not strictly valid UTF-8.Steps to reproduce
The result is byte-for-byte what Spark produces today (a one-byte string holding
0xFF), but the path throughfrom_utf8_uncheckedis UB and is therefore not guaranteed to keep producing that result under future Rust compiler / Arrow versions.Expected behavior
Replace the
from_utf8_uncheckedcall with a safe equivalent. Options:BinaryArrayas aStringArraywithout copying (Arrow stores both as the same byte layout): construct theStringArraydirectly from the buffers without going throughString::from_utf8_uncheckedon a freshly allocatedVec<u8>.Stringis genuinely needed, useString::from_utf8_lossy(value).into_owned()and accept the U+FFFD replacement on invalid sequences (this diverges from Spark for invalid bytes but is safe).Option 1 matches Spark exactly without UB.
Additional context
native/spark-expr/src/conversion_funcs/cast.rs::cast_binary_formatter(around line 865)CometCast.canCastFromBinaryandcanCastToStringinspark/src/main/scala/org/apache/comet/expressions/CometCast.scala.