-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Remove parquet arrow_cast dependency #9077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| let a = arrow_cast::cast(&array, &ArrowType::Date32)?; | ||
| let a = array | ||
| .as_primitive::<Int32Type>() | ||
| .reinterpret_cast::<Date32Type>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When not performing a widening / truncating conversion, reinterpret_cast will be faster
| .unary::<_, Int32Type>(|v| v.as_i128() as i32); | ||
| write_primitive(typed, array.values(), levels) | ||
| } | ||
| ArrowDataType::Dictionary(_, value_type) => match value_type.as_ref() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We instead lift this into the ArrowColumnWriter::write method which not only cuts down on duplication, but ensures we treat dictionary and non-dictionary values equivalently
| } | ||
|
|
||
| #[test] | ||
| fn test_arrow_writer_explicit_schema() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't really sure how to adapt this test, I don't really understand why we are relying on the parquet writer to do type coercion beyond that necessary for mapping the arrow data model to parquet.
@paleolimbot perhaps you can weigh in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation for explicitly setting the Parquet schema was so that I could write this test:
arrow-rs/parquet/tests/geospatial.rs
Lines 281 to 287 in 843bee2
| let schema = parquet_schema_geometry(); | |
| let props = WriterProperties::builder() | |
| .set_statistics_enabled(EnabledStatistics::Chunk) | |
| .build(); | |
| let options = ArrowWriterOptions::new() | |
| .with_parquet_schema(schema) | |
| .with_properties(props); |
At the time there was no way to write a Parquet Geometry logical type using the Arrow writer and the code path for statistics based on Arrow arrays couldn't be tested. Several months later this was added (if the geospatial feature is enabled, the appropriate extension type is translated to Geometry or Geography on write) and we no longer need this feature for the test.
It's potentially useful as an escape hatch if the build-time features for canonical extensions/geospatial/variant aren't fine-grained enough to satisfy an application (e.g., perhaps allowing a runtime config option) or perhaps to more easily create files with very specific Parquet schemas but I don't have strong feelings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you see any issues with the changes proposed in this PR for this use-case? I think it should still be fine, but I'm a little unfamiliar with how extension types have been wired in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be fine as well given that the geospatial.rs test still passes. The intent was to test that the code path invoked by with_parquet_schema() actually resulted in the Parquet schema being used by the writer. If that code path now results in something else happening than what I wrote here, it's probably best to keep the test but adjust the expected result (i.e., this test is just checking for plugged in wires).
|
The impact on compile time for a clean release build is rather nice Current main This PR So just under half the CPU time 🎉 Edit: I'm also pretty chuffed that this is somehow a net reduction in LOC 😅 |
|
amazing |
|
run benchmark arrow_reader arrow_reader_clickbench |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
My analysis of benchmarks is that there is no difference in performance |
|
Do you think this needs to wait for a breaking release, it is technically a breaking change, but only for a very new API that I suspect few people are using. |
I am not sure -- I need to review it in more detail first. I will do so later today |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR makes sense to me -- I had some comments, etc but I also think it could be merged as is. Thank you @tustvold
I personally think we should wait to merge this for a major release to give it some bake / downstream testing time -- given I hope to release 57.2.0 later this week, that would mean waiting until next week, which might be too long
Thank you for the help reviewing it @paleolimbot
|
|
||
| impl IntoBuffer for Vec<Int96> { | ||
| fn into_buffer(self, target_type: &ArrowType) -> Buffer { | ||
| let mut builder = Vec::with_capacity(self.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably doesn't matter but it might be faster to just collect the Vec directly
let builder: Vec::<i64> = match target_type {
ArrowType::Timestamp(TimeUnit::Second, _) => {
self.iter().map(|x| x.to_seconds()).collect()
}
ArrowType::Timestamp(TimeUnit::Millisecond, _) => {
self.iter().map(|x| x.to_millis()).collect()
}
ArrowType::Timestamp(TimeUnit::Microsecond, _) => {
self.iter().map(|x| x.to_micros()).collect()
}
ArrowType::Timestamp(TimeUnit::Nanosecond, _) => {
self.iter().map(|x| x.to_nanos()).collect()
}
_ => unreachable!("Invalid target_type for Int96."),
}
Buffer::from_vec(builder)
}| } | ||
| _ => unreachable!("INT96 must be a timestamp."), | ||
| }, | ||
| let array: ArrayRef = make_array(array_data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is probably one way your PR ends up removing net lines of code
|
|
||
| /// Coerce the parquet physical type array to the target type | ||
| /// | ||
| /// This should match the logic in schema::primitive::apply_hint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also please add the rationale why this doesn't use the cast kernel (and instead reimplements some sort of it) as a comment? As I understand it, the issue is that
- We are trying to keep dependencies down
- The semantics of casting certain types are different
| /// This should match the logic in schema::primitive::apply_hint | ||
| fn coerce_array(array: ArrayRef, target_type: &ArrowType) -> Result<ArrayRef> { | ||
| if let ArrowType::Dictionary(key_type, value_type) = target_type { | ||
| let dictionary = pack_dictionary(key_type, array.as_ref())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this imply that we lose the ability to read DictionaryArrays directly (without unpacking them first)? Or is this the fallback in case the data wasn't actually dictionary encoded (e.g. it was encoded using plain encoding but the user asked for Dictionary)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used as a fallback when the user requests dictionary encoding but the data isn't dictionary encoded. This can occur in a number of ways, but the two common ways are if the dictionary spilled and so you have plain encoded pages, or you are decoding a RecordBatch across a RowGroup boundary.
| // follow C++ implementation and use overflow/reinterpret cast from i64 to u64 which will map | ||
| // `i64::MIN..0` to `(i64::MAX as u64)..u64::MAX` | ||
| ArrowType::UInt64 => Arc::new(UInt64Array::new( | ||
| array.values().inner().clone().into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked that this is a clone of the Buffer (so relatively simple)
I wonder if can avoid all these Arc new / clones / etc by passing in the ArrayDataBuilder directly rather than a &Int64Array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like
let builder = array.into_data().into_builder();
...
builder = builder.data_type(DurationSecondType);
...
make_array(builder.build()?)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling array.into_data() will actually perform significantly more allocations, not to mention having significant additional dispatch overheads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am talking about into_data which takes an owned object, not the confusingly similarly named to_data which does requires a copy / cloning.
I actually saw the allocations (especially for StringViewArray and StructArray) appear in some profiling and I have a WIP to remove them (I want to break it into smaller PRs for easier review):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC into_data still allocates, as the buffers are stored in a Vec. In general ArrayData is normally best avoided, it is not free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, you are right -- in my other PR the ArrayData was already created, and then make_array is copying it again (along with its allocations) in several cases.
What I found is that some arrays also have Vecs in them (like StructArray and BinaryViewArray)
| match leaf.as_any_dictionary_opt() { | ||
| Some(dictionary) => { | ||
| let materialized = | ||
| arrow_select::take::take(dictionary.values(), dictionary.keys(), None)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR necessairly, but this seems to imply we are expanding out previously encoded dictionary data, rather than reusing the dictionary. Is that correct? I realize that we would hhave to handle the case when there are different dictionaries across batches, but it seems like a potential optimization worth considering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct, there is likely a fair amount of low-hanging fruit to optimising the parquet writer
Edit: Although it does appear there is something to handle dictionary array of bytearray, let me check I didn't break that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a separate specialized encoded for ByteArray and DictionaryArray of ByteArray - #2221
So this will only materialize dictionaries for primitives, which tbh is probably the most efficient thing to do.
| }; | ||
| let values = if let ArrowType::FixedSizeBinary(size) = **value_type { | ||
| arrow_cast::cast(&values, &ArrowType::FixedSizeBinary(size)).unwrap() | ||
| let binary = values.as_binary::<i32>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could avoid these clones here too using the into_builder and make_array stuff too
| [dependencies] | ||
| arrow-array = { workspace = true, optional = true } | ||
| arrow-buffer = { workspace = true, optional = true } | ||
| arrow-cast = { workspace = true, optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
That's perfectly fine, this isn't blocking anything on my end - I can just use a git pin. I just wanted to avoid this potentially bit-rotting, a week is perfectly fine 😄 |
|
I think main is open for breaking changes now so we can merge this one in |
Which issue does this PR close?
Rationale for this change
Arrow_cast is fairly heavy dependency, especially now that it bundles in arrow-ord for RunEndEncodedArrays (#8708). Removing this dependency has been discussed as far back as 2024, let's finally actually do it #4764.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
Yes, unfortunately #8524 added an API that allows overriding the inferred schema, which in turn allows the coercion machinery to traverse somewhat unintended paths. I personally think this API shouldn't exist, but...