-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add support for FixedSizeList to variant_to_arrow #9663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
808b1b5
1d55ade
11ffd40
4a84d12
64bf4aa
99c6cc5
47f99c5
384ed29
220427f
acd294a
d2e4dd6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -68,6 +68,14 @@ use std::sync::Arc; | |
| /// See [`ShreddedSchemaBuilder`] for a convenient way to build the `as_type` | ||
| /// value passed to this function. | ||
| pub fn shred_variant(array: &VariantArray, as_type: &DataType) -> Result<VariantArray> { | ||
| shred_variant_with_options(array, as_type, &CastOptions::default()) | ||
| } | ||
|
|
||
| pub fn shred_variant_with_options( | ||
| array: &VariantArray, | ||
| as_type: &DataType, | ||
| cast_options: &CastOptions, | ||
| ) -> Result<VariantArray> { | ||
| if array.typed_value_field().is_some() { | ||
| return Err(ArrowError::InvalidArgumentError( | ||
| "Input is already shredded".to_string(), | ||
|
|
@@ -79,10 +87,9 @@ pub fn shred_variant(array: &VariantArray, as_type: &DataType) -> Result<Variant | |
| return Ok(array.clone()); | ||
| }; | ||
|
|
||
| let cast_options = CastOptions::default(); | ||
| let mut builder = make_variant_to_shredded_variant_arrow_row_builder( | ||
| as_type, | ||
| &cast_options, | ||
| cast_options, | ||
| array.len(), | ||
| NullValue::TopLevelVariant, | ||
| )?; | ||
|
|
@@ -321,12 +328,19 @@ impl<'a> VariantToShreddedArrayVariantRowBuilder<'a> { | |
| // If the variant is not an array, typed_value must be null. | ||
| // If the variant is an array, value must be null. | ||
| match variant { | ||
| Variant::List(list) => { | ||
| Variant::List(ref list) => { | ||
| self.nulls.append_non_null(); | ||
| self.value_builder.append_null(); | ||
| self.typed_value_builder | ||
| .append_value(&Variant::List(list))?; | ||
| Ok(true) | ||
|
|
||
| // With `safe` cast option set to false, appending list of wrong size to | ||
| // `typed_value_builder` of type `FixedSizeList` will result in an error. In such a | ||
| // case, the provided list should be appended to the `value_builder. | ||
|
Comment on lines
+334
to
+336
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure the variant shredding spec allows for shredding as a fixed size list, if the resulting layout differs physically from a normal list?
It looks to me like any attempt to shred as fixed-sized list must either succeed (if the size is correct) or hard-fail (because
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It differs physically on the Arrow side, but once we write it to Parquet it'd be same as other We're keeping
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When casting from variant to arrow, we can do whatever we want. But this code here is about going from binary variant to shredded variant. And the variant shredding spec directly forbids
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True. I think the core issue is that Parquet currently has only one logical Btw, there’s ongoing work on this too: apache/parquet-format#241 (recently revived). Given the current spec text: I read “array” as "a value matching the specific list shape we’re shredding into". For
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From what I understand, the variant spec neither knows nor cares about the intricacies of arrow array types (it also doesn't care about spark or SQL). If we're shredding to a 3-level parquet list, and we encounter a variant array value, the resulting
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @scovich Do you think we should hard-fail on the length check for
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know... My gut reaction is to forbid it completely because it's preferable to fail early and consistently rather than blow up unpredictably partway through shredding. The spec forbids us to fallback to But I also don't have any intuition of actual use cases for this feature, to weigh its benefits against the downsides. NOTE: The same dilemma applies to fixed length binary and fixed length string, except the spec technically doesn't forbid us to fallback to the Thought experimentSuppose we could use the Overall, enforcing list/binary/string length feels like a logical validation issue while shredding is a physical operation. And I suspect it's best to leave client code to enforce logical validity of all forms, not try to encode it in the type system. Additionally, It feels like fixed-len-xxx types fall in the same category as unsigned or non-zero types. True, arrow has unsigned int types but it lacks non-zero types. And parquet has neither.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True, the use case should be the first thing we think about before implementing. FSL could be useful for embeddings and tensors but the question is why would anyone put them inside a The fact that we can't store FSL in Parquet is a big reason to not do it.
Overall, I think we'd not gain much by including FSL because it's so frail and the use cause is not defined. Isn't the whole point of Variant is to host semi-structured data, if we have fixed-len-something it'd be better to keep a separate column for it. If fixed-length is a logical constraint rather than a physical one, enforcement belongs downstream.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Agree. I just realized that somebody could create a FSL of [shredded] variant, if the use case just needs flexible typing of N list members. But that also seems somewhat contrived, to have such strongly-typed structure but sufficient uncertainty about the list element type to merit using variant. |
||
| let shredded = self.typed_value_builder.append_value(&variant)?; | ||
| if shredded { | ||
| self.value_builder.append_null(); | ||
| } else { | ||
| self.value_builder.append_value(Variant::List(list.clone())); | ||
| } | ||
| Ok(shredded) | ||
| } | ||
| other => { | ||
| self.nulls.append_non_null(); | ||
|
|
@@ -690,9 +704,9 @@ mod tests { | |
| use super::*; | ||
| use crate::VariantArrayBuilder; | ||
| use arrow::array::{ | ||
| Array, BinaryViewArray, FixedSizeBinaryArray, Float64Array, GenericListArray, | ||
| GenericListViewArray, Int64Array, LargeBinaryArray, LargeStringArray, ListArray, | ||
| ListLikeArray, OffsetSizeTrait, PrimitiveArray, StringArray, | ||
| Array, BinaryViewArray, FixedSizeBinaryArray, FixedSizeListArray, Float64Array, | ||
| GenericListArray, GenericListViewArray, Int64Array, LargeBinaryArray, LargeStringArray, | ||
| ListArray, ListLikeArray, OffsetSizeTrait, PrimitiveArray, StringArray, StructArray, | ||
| }; | ||
| use arrow::datatypes::{ | ||
| ArrowPrimitiveType, DataType, Field, Fields, Int64Type, TimeUnit, UnionFields, UnionMode, | ||
|
|
@@ -1608,17 +1622,105 @@ mod tests { | |
|
|
||
| #[test] | ||
| fn test_array_shredding_as_fixed_size_list() { | ||
| let input = build_variant_array(vec![ | ||
| VariantRow::List(vec![VariantValue::from(1i64), VariantValue::from(2i64)]), | ||
| VariantRow::Value(VariantValue::from("This should not be shredded")), | ||
| VariantRow::List(vec![VariantValue::from(3i64), VariantValue::from(4i64)]), | ||
| ]); | ||
|
|
||
| let list_schema = | ||
| DataType::FixedSizeList(Arc::new(Field::new("item", DataType::Int64, true)), 2); | ||
| let result = shred_variant(&input, &list_schema).unwrap(); | ||
| assert_eq!(result.len(), 3); | ||
|
|
||
| // The first row should be shredded, so the `value` field should be null and the | ||
| // `typed_value` field should contain the list | ||
| assert!(result.is_valid(0)); | ||
| assert!(result.value_field().unwrap().is_null(0)); | ||
| assert!(result.typed_value_field().unwrap().is_valid(0)); | ||
|
|
||
| // The second row should not be shredded because the provided schema for shredding did not | ||
| // match. Hence, the `value` field should contain the raw value and the `typed_value` field | ||
| // should be null. | ||
| assert!(result.is_valid(1)); | ||
| assert!(result.value_field().unwrap().is_valid(1)); | ||
| assert!(result.typed_value_field().unwrap().is_null(1)); | ||
|
|
||
| // The third row should be shredded, so the `value` field should be null and the | ||
| // `typed_value` field should contain the list | ||
| assert!(result.is_valid(2)); | ||
| assert!(result.value_field().unwrap().is_null(2)); | ||
| assert!(result.typed_value_field().unwrap().is_valid(2)); | ||
|
|
||
| let typed_value = result.typed_value_field().unwrap(); | ||
| let fixed_size_list = typed_value | ||
| .as_any() | ||
| .downcast_ref::<FixedSizeListArray>() | ||
| .expect("Expected FixedSizeListArray"); | ||
|
|
||
| // Verify that typed value is `FixedSizeList`. | ||
| assert_eq!(fixed_size_list.len(), 3); | ||
| assert_eq!(fixed_size_list.value_length(), 2); | ||
|
|
||
| // Verify that the first entry in the `FixedSizeList` contains the expected value. | ||
| let val0 = fixed_size_list.value(0); | ||
| let val0_struct = val0.as_any().downcast_ref::<StructArray>().unwrap(); | ||
| let val0_typed = val0_struct.column_by_name("typed_value").unwrap(); | ||
| let val0_ints = val0_typed.as_any().downcast_ref::<Int64Array>().unwrap(); | ||
| assert_eq!(val0_ints.values(), &[1i64, 2i64]); | ||
|
|
||
| // Verify that second entry in the `FixedSizeList` cannot be shredded hence the value is | ||
| // invalid. | ||
| assert!(fixed_size_list.is_null(1)); | ||
|
|
||
| // Verify that the third entry in the `FixedSizeList` contains the expected value. | ||
| let val2 = fixed_size_list.value(2); | ||
| let val2_struct = val2.as_any().downcast_ref::<StructArray>().unwrap(); | ||
| let val2_typed = val2_struct.column_by_name("typed_value").unwrap(); | ||
| let val2_ints = val2_typed.as_any().downcast_ref::<Int64Array>().unwrap(); | ||
| assert_eq!(val2_ints.values(), &[3i64, 4i64]); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_array_shredding_as_fixed_size_list_wrong_size() { | ||
|
rishvin marked this conversation as resolved.
|
||
| let input = build_variant_array(vec![VariantRow::List(vec![ | ||
| VariantValue::from(1i64), | ||
| VariantValue::from(2i64), | ||
| VariantValue::from(3i64), | ||
| ])]); | ||
| let list_schema = | ||
| DataType::FixedSizeList(Arc::new(Field::new("item", DataType::Int64, true)), 2); | ||
| let err = shred_variant(&input, &list_schema).unwrap_err(); | ||
| assert_eq!( | ||
| err.to_string(), | ||
| "Not yet implemented: Converting unshredded variant arrays to arrow fixed-size lists" | ||
|
|
||
| let result = shred_variant_with_options( | ||
| &input, | ||
| &list_schema, | ||
| &CastOptions { | ||
| safe: true, | ||
| ..Default::default() | ||
| }, | ||
| ) | ||
| .unwrap(); | ||
| assert_eq!(result.len(), 1); | ||
|
|
||
| // With `safe` set to to true, the incorrect size should not raise error. | ||
| assert!(result.is_valid(0)); | ||
| assert!(result.value_field().unwrap().is_valid(0)); | ||
| assert!(result.typed_value_field().unwrap().is_null(0)); | ||
|
|
||
| // With `safe` set to false, the incorrect size should raise error. | ||
| let err = shred_variant_with_options( | ||
| &input, | ||
| &list_schema, | ||
| &CastOptions { | ||
| safe: false, | ||
| ..Default::default() | ||
| }, | ||
| ) | ||
| .unwrap_err(); | ||
| assert!( | ||
| err.to_string() | ||
| .contains("Expected fixed size list of size 2, got size 3"), | ||
| "got: {err}", | ||
| ); | ||
| } | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.