Skip to content

Propagation of metadata through casts can strips extension type from the destination field and can result in invalid extension types #22079

@paleolimbot

Description

@paleolimbot

Describe the bug

After #21322 and #21390 the Cast "to" field that was added in #18136 to be able to represent a cast to an extension type (by including field metadata on the cast target) doesn't have its field metadata consistently represented on the output expr.to_field(). Instead, the metadata is propagating from the input which often does not make sense (particularly in the intended usage of the cast field where the metadata is meaningful)

The main pieces of code that affect the ability of the Expr::Cast to represent a cast to an extension type are where the expression field is resolved in the logical expression:

fn cast_output_field(
source_field: &FieldRef,
target_type: &DataType,
force_nullable: bool,
) -> Arc<Field> {
let mut f = source_field
.as_ref()
.clone()
.with_data_type(target_type.clone())
.with_metadata(source_field.metadata().clone());
if force_nullable {
f = f.with_nullable(true);
}
Arc::new(f)
}

...and where it is resolved in a physical expression:

fn resolved_target_field(&self, input_schema: &Schema) -> Result<FieldRef> {
if is_default_target_field(&self.target_field) {
self.expr.return_field(input_schema).map(|field| {
Arc::new(
field
.as_ref()
.clone()
.with_data_type(self.cast_type().clone()),
)
})
} else {
Ok(Arc::clone(&self.target_field))
}
}

An example of where this results in output that is non-sensical/may be rejected by consumers is a cast from a UUID extension type to Utf8 (which will now happily execute but output the arrow.uuid extension name on the Utf8 storage, which is not allowed). Conversely, a cast from Utf8 to UUID may work for strings that are 16 characters long (maybe, I forget the casting rules between string and binary), but definitely would not cast to a UUID (before the two noted changes this would have given a clear error.

There are probably some casts that make sense to execute in this way (e.g., a cast where the data type is identical to the input, while usually simplified away, could in theory propagate extension type metadata safely). I am not sure whether it makes sense for other metadata to be propagated in this way (the usage of non-extension metadata is highly variable). I personally think it is safer to strip metadata in a cast (for the same reason that we strip metadata when executing a scalar function).

To Reproduce

Cast a UUID column to a string and look at the field metadata / try to consume with pyarrow.

Expected behavior

I would have expected a cast to UUID to error (because the result is not a UUID).

Additional context

In #21071 I'm close to a mechanism to actually execute casts to and from extension types, where the output field of a cast to UUID really is a UUID. I can also isolate a smaller change from that PR to fix this behaviour.

cc @adriangb since we discussed this behaviour in one of the above PRs and in the extension types work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions