Extension Data Types #6500

connortsui20 · 2026-02-13T17:48:59Z

connortsui20
Feb 13, 2026
Maintainer

More Robust Extension Data Types In Vortex

We would like to build a more robust system for extension data types (or DTypes).

#6081 introduced vtables for extension DTypes. Each extension type (e.g. Timestamp) now implements ExtDTypeVTable, which handles validation, serialization, and metadata. The type-erased ExtDTypeRef carries this vtable with it inside DType::Extension.

The natural next steps would be to add analogous vtables for vortex-scalar (e.g., custom Display and casting) and vortex-array (custom compute kernels). This would give us three traits:

ExtDTypeVTable   (vortex-dtype)
ExtScalarVTable  (vortex-scalar)
ExtArrayVTable   (vortex-array)

Issues

There is a problem with this. This has come up a few times in the effort to add the scalar extension vtable to scalar extension values: #6477.

Once an ExtDType is type-erased to ExtDTypeRef, the only thing it carries is the dtype vtable (ONLY the dtype, not the scalar or array vtables). Suppose you have an ExtensionArray and want to call scalar or array logic: you then need to look up the other vtables by ExtID in a session registry. This means threading &VortexSession through every code path that touches extension types (this would be things like compute kernels, builders, canonicalization, display, etc.)

This is kind of torturous because the dtype vtable is literally right there inside the DType, but the scalar/array vtables require a registry lookup to find. If the vtables were combined we would not have this issue. So to fix this, sessions need to be plumbed through APIs that otherwise have no reason to take one (in other words, many constructors would need to take a session if they could potentially create an extension type).

What we probably want is a single ExtVTable per extension dtype that covers all three layers, so that when you have an ExtDTypeRef you already have everything you need.

Crate Dependency Graph!

The crate dependency graph is:

vortex-array --(depends on)--> vortex-scalar --(depends on)--> vortex-dtype

A unified vtable trait would need to reference types from all three crates, which is impossible when the trait lives in vortex-dtype, which can't depend on vortex-scalar or vortex-array.

Potential Solutions

Here are some potential solutions, some uglier than others...

Make the `VortexSession` a global static

This is not great for hygiene, but it would mean that everything can access the session and look up vtables without having to pass VortexSession around everywhere.

Of course, if this is worth adding a global execution context is up for debate.

Merge the Crates

In my opinion, this is a better solution than the above.

If vortex-dtype, vortex-scalar, and vortex-array were a single crate (or at least the extension vtable machinery lived in one place that could see all three), we could define:

pub trait ExtVTable: 'static + Send + Sync + ... {
    type Metadata: ...;

    // `DType`

    fn id(&self) -> ExtID;
    fn validate(&self, metadata: &Self::Metadata, storage: &DType) -> VortexResult<()>;
    fn serialize(&self, metadata: &Self::Metadata) -> VortexResult<Vec<u8>>;
    fn deserialize(&self, data: &[u8]) -> VortexResult<Self::Metadata>;

    // `Scalar`

    // (This is not how it actually would look, but close enough)
    fn display(&self, metadata: &Self::Metadata, value: &ScalarValue, f: &mut fmt::Formatter) -> fmt::Result { ... }
    fn cast(&self, ...) -> VortexResult<Scalar> { ... }

    // `ArrayRef`

    fn cast_array(&self, ...) -> VortexResult<ArrayRef> { ... }
    // <-- Probably a lot more than this -->
}

`ScalarValue`s should store `ArrayRef` instead of `Vec<Option<ScalarValue>>`

Another thing that I have yet to mention is that we probably want to have ScalarValues that can hold an ArrayRef directly. Right now, scalar lists are stored as Vec<Option<ScalarValue>>, which is extremely heavyweight. You can imagine for an extension type like a Tensor that scalars would instantly become a bottleneck for any compute operations like matrix multiplication.

This is impossible with the current crate structure as vortex-array depends on vortex-scalar, so we cannot store arrays inside scalars.

Arguably, the fact that we cannot do this is the only reason that our scalars are not performant. This is the only variant that currently makes an owned allocation on creation (as opposed to shared allocations like ByteBuffer).

Open Questions

Are there other approaches that avoid merging the crates or having global static variables? We haven't been able to think of any.
Is the crate split between dtype/scalar/array load-bearing for compile times or other reasons (I strongly doubt this)?
Are there extension-array operations that shouldn't be bundled into the vtable?
Is this overkill or underkill?
It might be a good idea to figure out what exactly we want from extension types. The extension types we know want are tensors and UUID, but it might be a good idea to figure out what kinds of APIs they need and what a clean interface with Vortex might look like.

CC @gatesn

a10y · 2026-02-13T21:10:36Z

a10y
Feb 13, 2026
Maintainer

Another alternative I was mulling over is adding VTable pointers to the ExtDTypeVTable but type-erased, e.g.

And then the appropriate component would be able to downcast as necessary. But that actually doesn't end up working b/c you can't upcast from Any to another DynTrait.

The central tension here is that extension types need override behavior across all of DTypes, Scalars and Arrays, and when you're implementing one of these extension types, you generally need to implement all of the behavior in one compilation unit. So I guess it seems fairly natural to merge them.

0 replies

connortsui20 · 2026-02-13T23:47:21Z

connortsui20
Feb 13, 2026
Maintainer Author

Another thing I should mention is that the VTable pattern we use is very, very difficult to understand without taking a good amount of time in the trenches. (Or maybe I'm just dumb and it is actually easy to understand, but I have a feeling this isn't the case)

Obviously users of Vortex do not have to deal with the internals of how this works. But for maintainability having 3 separate VTables (that get wrapped up somewhere else) makes it that much more difficult to understand how extension types will work.

Of course our users should not have to worry about the internals, but if someone wants to add an extension type on their own (without our help), I really would not want to be tracing through all of the dispatch with rust-analyzer if something has gone wrong... After all, the whole reason we want extension types is so that we don't have to write that code!

0 replies

gatesn · 2026-02-14T02:39:21Z

gatesn
Feb 14, 2026
Maintainer

I have a discussion here on vtables: #6093

I actually quite like what the latest design (ExtDType) gets us. But I agree that the names of each part of the design pattern make things very confusing. We should fix up the names and then unify across all extension points

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extension Data Types #6500

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Extension Data Types #6500

Uh oh!

Uh oh!

connortsui20 Feb 13, 2026 Maintainer

More Robust Extension Data Types In Vortex

Issues

Crate Dependency Graph!

Potential Solutions

Make the VortexSession a global static

Merge the Crates

ScalarValues should store ArrayRef instead of Vec<Option<ScalarValue>>

Open Questions

Replies: 3 comments

Uh oh!

Uh oh!

a10y Feb 13, 2026 Maintainer

Uh oh!

connortsui20 Feb 13, 2026 Maintainer Author

Uh oh!

gatesn Feb 14, 2026 Maintainer

connortsui20
Feb 13, 2026
Maintainer

Make the `VortexSession` a global static

`ScalarValue`s should store `ArrayRef` instead of `Vec<Option<ScalarValue>>`

a10y
Feb 13, 2026
Maintainer

connortsui20
Feb 13, 2026
Maintainer Author

gatesn
Feb 14, 2026
Maintainer