Skip to content

[Bug] CAST(BinaryType AS StringType) uses unsafe from_utf8_unchecked (undefined behaviour) #4488

@andygrove

Description

@andygrove

Describe the bug

CAST(<binary> AS STRING) in Comet runs through cast_binary_formatter in native/spark-expr/src/conversion_funcs/cast.rs, which uses unsafe { String::from_utf8_unchecked(value.to_vec()) } to convert non-UTF8 bytes into a String. This is undefined behaviour in Rust (String is a documented invariant: its bytes must be valid UTF-8), and any downstream code that relies on that invariant (e.g. iterating over chars, slicing on char boundaries) can misbehave for inputs that are not valid UTF-8.

Spark's UTF8String.fromBytes does not validate either, but it stores the bytes in a non-String-typed buffer, so it does not violate any Java-level invariant. The Comet path is the dangerous case.

Surfaced by the cast audit (collection PR queue). Today's CometCast.isSupported((BinaryType, StringType), ...) returns Compatible(None) so this path runs by default for any binary column whose contents are not strictly valid UTF-8.

Steps to reproduce

SELECT CAST(X'FF' AS STRING);

The result is byte-for-byte what Spark produces today (a one-byte string holding 0xFF), but the path through from_utf8_unchecked is UB and is therefore not guaranteed to keep producing that result under future Rust compiler / Arrow versions.

Expected behavior

Replace the from_utf8_unchecked call with a safe equivalent. Options:

  1. Reinterpret the underlying buffer of BinaryArray as a StringArray without copying (Arrow stores both as the same byte layout): construct the StringArray directly from the buffers without going through String::from_utf8_unchecked on a freshly allocated Vec<u8>.
  2. If a String is genuinely needed, use String::from_utf8_lossy(value).into_owned() and accept the U+FFFD replacement on invalid sequences (this diverges from Spark for invalid bytes but is safe).

Option 1 matches Spark exactly without UB.

Additional context

  • Native impl: native/spark-expr/src/conversion_funcs/cast.rs::cast_binary_formatter (around line 865)
  • Comet matrix: CometCast.canCastFromBinary and canCastToString in spark/src/main/scala/org/apache/comet/expressions/CometCast.scala.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions