Describe the bug
After #21322 and #21390 the Cast "to" field that was added in #18136 to be able to represent a cast to an extension type (by including field metadata on the cast target) doesn't have its field metadata consistently represented on the output expr.to_field(). Instead, the metadata is propagating from the input which often does not make sense (particularly in the intended usage of the cast field where the metadata is meaningful)
The main pieces of code that affect the ability of the Expr::Cast to represent a cast to an extension type are where the expression field is resolved in the logical expression:
|
fn cast_output_field( |
|
source_field: &FieldRef, |
|
target_type: &DataType, |
|
force_nullable: bool, |
|
) -> Arc<Field> { |
|
let mut f = source_field |
|
.as_ref() |
|
.clone() |
|
.with_data_type(target_type.clone()) |
|
.with_metadata(source_field.metadata().clone()); |
|
if force_nullable { |
|
f = f.with_nullable(true); |
|
} |
|
Arc::new(f) |
|
} |
...and where it is resolved in a physical expression:
|
fn resolved_target_field(&self, input_schema: &Schema) -> Result<FieldRef> { |
|
if is_default_target_field(&self.target_field) { |
|
self.expr.return_field(input_schema).map(|field| { |
|
Arc::new( |
|
field |
|
.as_ref() |
|
.clone() |
|
.with_data_type(self.cast_type().clone()), |
|
) |
|
}) |
|
} else { |
|
Ok(Arc::clone(&self.target_field)) |
|
} |
|
} |
An example of where this results in output that is non-sensical/may be rejected by consumers is a cast from a UUID extension type to Utf8 (which will now happily execute but output the arrow.uuid extension name on the Utf8 storage, which is not allowed). Conversely, a cast from Utf8 to UUID may work for strings that are 16 characters long (maybe, I forget the casting rules between string and binary), but definitely would not cast to a UUID (before the two noted changes this would have given a clear error.
There are probably some casts that make sense to execute in this way (e.g., a cast where the data type is identical to the input, while usually simplified away, could in theory propagate extension type metadata safely). I am not sure whether it makes sense for other metadata to be propagated in this way (the usage of non-extension metadata is highly variable). I personally think it is safer to strip metadata in a cast (for the same reason that we strip metadata when executing a scalar function).
To Reproduce
Cast a UUID column to a string and look at the field metadata / try to consume with pyarrow.
Expected behavior
I would have expected a cast to UUID to error (because the result is not a UUID).
Additional context
In #21071 I'm close to a mechanism to actually execute casts to and from extension types, where the output field of a cast to UUID really is a UUID. I can also isolate a smaller change from that PR to fix this behaviour.
cc @adriangb since we discussed this behaviour in one of the above PRs and in the extension types work.
Describe the bug
After #21322 and #21390 the
Cast"to" field that was added in #18136 to be able to represent a cast to an extension type (by including field metadata on the cast target) doesn't have its field metadata consistently represented on the outputexpr.to_field(). Instead, the metadata is propagating from the input which often does not make sense (particularly in the intended usage of the castfieldwhere the metadata is meaningful)The main pieces of code that affect the ability of the
Expr::Castto represent a cast to an extension type are where the expression field is resolved in the logical expression:datafusion/datafusion/expr/src/expr_schema.rs
Lines 74 to 88 in 53517af
...and where it is resolved in a physical expression:
datafusion/datafusion/physical-expr/src/expressions/cast.rs
Lines 153 to 166 in 53517af
An example of where this results in output that is non-sensical/may be rejected by consumers is a cast from a UUID extension type to Utf8 (which will now happily execute but output the arrow.uuid extension name on the Utf8 storage, which is not allowed). Conversely, a cast from Utf8 to UUID may work for strings that are 16 characters long (maybe, I forget the casting rules between string and binary), but definitely would not cast to a UUID (before the two noted changes this would have given a clear error.
There are probably some casts that make sense to execute in this way (e.g., a cast where the data type is identical to the input, while usually simplified away, could in theory propagate extension type metadata safely). I am not sure whether it makes sense for other metadata to be propagated in this way (the usage of non-extension metadata is highly variable). I personally think it is safer to strip metadata in a cast (for the same reason that we strip metadata when executing a scalar function).
To Reproduce
Cast a UUID column to a string and look at the field metadata / try to consume with pyarrow.
Expected behavior
I would have expected a cast to UUID to error (because the result is not a UUID).
Additional context
In #21071 I'm close to a mechanism to actually execute casts to and from extension types, where the output field of a cast to UUID really is a UUID. I can also isolate a smaller change from that PR to fix this behaviour.
cc @adriangb since we discussed this behaviour in one of the above PRs and in the extension types work.