Extension Data Types #6500
Replies: 3 comments
-
Beta Was this translation helpful? Give feedback.
-
|
Another thing I should mention is that the Obviously users of Vortex do not have to deal with the internals of how this works. But for maintainability having 3 separate Of course our users should not have to worry about the internals, but if someone wants to add an extension type on their own (without our help), I really would not want to be tracing through all of the dispatch with rust-analyzer if something has gone wrong... After all, the whole reason we want extension types is so that we don't have to write that code! |
Beta Was this translation helpful? Give feedback.
-
|
I have a discussion here on vtables: #6093 I actually quite like what the latest design (ExtDType) gets us. But I agree that the names of each part of the design pattern make things very confusing. We should fix up the names and then unify across all extension points |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
More Robust Extension Data Types In Vortex
We would like to build a more robust system for extension data types (or
DTypes).#6081 introduced vtables for extension DTypes. Each extension type (e.g.
Timestamp) now implementsExtDTypeVTable, which handles validation, serialization, and metadata. The type-erasedExtDTypeRefcarries this vtable with it insideDType::Extension.The natural next steps would be to add analogous vtables for
vortex-scalar(e.g., customDisplayand casting) andvortex-array(custom compute kernels). This would give us three traits:Issues
There is a problem with this. This has come up a few times in the effort to add the scalar extension vtable to scalar extension values: #6477.
Once an
ExtDTypeis type-erased toExtDTypeRef, the only thing it carries is the dtype vtable (ONLY the dtype, not the scalar or array vtables). Suppose you have anExtensionArrayand want to call scalar or array logic: you then need to look up the other vtables byExtIDin a session registry. This means threading&VortexSessionthrough every code path that touches extension types (this would be things like compute kernels, builders, canonicalization, display, etc.)This is kind of torturous because the dtype vtable is literally right there inside the
DType, but the scalar/array vtables require a registry lookup to find. If the vtables were combined we would not have this issue. So to fix this, sessions need to be plumbed through APIs that otherwise have no reason to take one (in other words, many constructors would need to take a session if they could potentially create an extension type).What we probably want is a single
ExtVTableper extension dtype that covers all three layers, so that when you have anExtDTypeRefyou already have everything you need.Crate Dependency Graph!
The crate dependency graph is:
A unified vtable trait would need to reference types from all three crates, which is impossible when the trait lives in
vortex-dtype, which can't depend onvortex-scalarorvortex-array.Potential Solutions
Here are some potential solutions, some uglier than others...
Make the
VortexSessiona global staticThis is not great for hygiene, but it would mean that everything can access the session and look up vtables without having to pass
VortexSessionaround everywhere.Of course, if this is worth adding a global execution context is up for debate.
Merge the Crates
In my opinion, this is a better solution than the above.
If
vortex-dtype,vortex-scalar, andvortex-arraywere a single crate (or at least the extension vtable machinery lived in one place that could see all three), we could define:ScalarValues should storeArrayRefinstead ofVec<Option<ScalarValue>>Another thing that I have yet to mention is that we probably want to have
ScalarValues that can hold anArrayRefdirectly. Right now, scalar lists are stored asVec<Option<ScalarValue>>, which is extremely heavyweight. You can imagine for an extension type like a Tensor that scalars would instantly become a bottleneck for any compute operations like matrix multiplication.This is impossible with the current crate structure as
vortex-arraydepends onvortex-scalar, so we cannot store arrays inside scalars.Arguably, the fact that we cannot do this is the only reason that our scalars are not performant. This is the only variant that currently makes an owned allocation on creation (as opposed to shared allocations like
ByteBuffer).Open Questions
CC @gatesn
Beta Was this translation helpful? Give feedback.
All reactions