Skip to content

Commit c41bb3c

Browse files
committed
VariantGet RFC
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
1 parent 5cf675e commit c41bb3c

1 file changed

Lines changed: 109 additions & 0 deletions

File tree

rfcs/0057-variant-get-expr.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
- Start Date: 2026-05-05
2+
- Authors: @AdamGS
3+
- RFC PR: [vortex-data/rfcs#57](https://github.com/vortex-data/rfcs/pull/57)
4+
5+
# VariantGet Expression
6+
7+
## Summary
8+
9+
Introduce a new `VariantGet` expression that extracts useable data from variant arrays.
10+
11+
## Motivation
12+
13+
As described in the [Variant RFC](https://github.com/vortex-data/rfcs/blob/develop/rfcs/0015-variant-type.md),
14+
variants arrays are useful for many use cases, but in order to actually use the data a fully typed array is required.
15+
16+
## Design
17+
18+
### Definition
19+
20+
A new VariantGet expression is required, the expression has two inputs:
21+
22+
1. Path to the required child - similar to JSONPath, combination of names and indexes.
23+
2. Optional dtype, if None - the return type is `None`, the expression's return type is `Variant`.
24+
25+
### Array
26+
27+
The canonical Variant array will add an additional child, representing optional shredded data, it will now have:
28+
29+
1. Validity
30+
2. Core storage - containing the raw unshredded data, which can be encoded in any way the child array's encoding.
31+
3. A shredded child.
32+
33+
The "core storage" child may also have its own shredded child.
34+
35+
### Execution
36+
37+
When executing the expression on a variant array, it will pull out recursively shredded data until the path is exhausted OR the path reached a child path that isn't shredded. As we traverse the chain of shredded children along the path, we'll need to make sure to keep track of their validity, as the leaf child's validity is an OR of all of them.
38+
39+
At this point, we have 3 possible cases:
40+
41+
1. Perfectly shredded - there's a fully shredded child at this path. If it matches the expected type or can be casted into it, we can just return it.
42+
2. Partially shredded - data for this path exists in both the shredded child AND in some unshredded values, which we can merge according to the expected type.
43+
3. Unshredded - No shredded child at this path, we try and extract the relevant value from the unshredded values which are unchanged from the original array.
44+
45+
The important invariant is that `VariantGet` changes the typed child selected for the requested
46+
path, but it does not rewrite the raw unshredded data. The raw storage continues to represent the
47+
same original variant values and can still be used by later `VariantGet` expressions for paths that
48+
were not shredded.
49+
50+
```text
51+
Variant array before VariantGet("$.a.b", i64)
52+
53+
+--------------------------------------------------------------+
54+
| validity |
55+
| raw unshredded data --------------------------------------+ |
56+
| shredded children | |
57+
| $.a.b: utf8 / missing / partially materialized | |
58+
| $.x.y: bool | |
59+
+----------------------------------------------------------|---+
60+
|
61+
VariantGet("$.a.b", i64) | unchanged
62+
|
63+
+----------------------------------------------------------|---+
64+
| validity for rows where $.a.b can be read as i64 | |
65+
| raw unshredded data <------------------------------------+ |
66+
| typed child: i64 values for $.a.b |
67+
| built from shredded data, raw data, or a merge of both |
68+
+--------------------------------------------------------------+
69+
```
70+
71+
## Compatibility
72+
73+
TODO: Explain compatibility concerns.
74+
75+
- Does this change the file format or wire format? Is it backward or forward compatible?
76+
- Does this break any public APIs? If so, what is the migration path?
77+
- Are there performance implications?
78+
79+
If there are no compatibility concerns, briefly state why.
80+
81+
## Drawbacks
82+
83+
TODO: Explain the cost of this change.
84+
85+
- Why should we not do this?
86+
- What is the maintenance cost?
87+
- Does this add complexity that could be avoided?
88+
89+
## Alternatives
90+
91+
We can make the dtype parameter required, but I do think that the optional one keeps execution more flexible and opens up
92+
opportunities for different usage, which is useful for compute engines that have more flexible type systems or that might want
93+
to process the raw byte data themselves.
94+
95+
## Prior Art
96+
97+
See the [Variant RFC](https://github.com/vortex-data/rfcs/blob/develop/rfcs/0015-variant-type.md).
98+
99+
## Unresolved Questions
100+
101+
TODO: List open questions for the RFC process.
102+
103+
- What parts of the design still need to be resolved?
104+
- What is explicitly out of scope?
105+
- What can be deferred to implementation?
106+
107+
## Future Possibilities
108+
109+
TODO: Capture natural extensions or follow-on work that are out of scope for this RFC.

0 commit comments

Comments
 (0)