|
| 1 | +- Start Date: 2026-05-05 |
| 2 | +- Authors: @AdamGS |
| 3 | +- RFC PR: [vortex-data/rfcs#57](https://github.com/vortex-data/rfcs/pull/57) |
| 4 | + |
| 5 | +# VariantGet Expression |
| 6 | + |
| 7 | +## Summary |
| 8 | + |
| 9 | +Introduce a new `VariantGet` expression that extracts useable data from variant arrays. |
| 10 | + |
| 11 | +## Motivation |
| 12 | + |
| 13 | +As described in the [Variant RFC](https://github.com/vortex-data/rfcs/blob/develop/rfcs/0015-variant-type.md), |
| 14 | +variants arrays are useful for many use cases, but in order to actually use the data a fully typed array is required. |
| 15 | + |
| 16 | +## Design |
| 17 | + |
| 18 | +### Definition |
| 19 | + |
| 20 | +A new VariantGet expression is required, the expression has two inputs: |
| 21 | + |
| 22 | +1. Path to the required child - similar to JSONPath, combination of names and indexes. |
| 23 | +2. Optional dtype, if None - the return type is `None`, the expression's return type is `Variant`. |
| 24 | + |
| 25 | +### Array |
| 26 | + |
| 27 | +The canonical Variant array will add an additional child, representing optional shredded data, it will now have: |
| 28 | + |
| 29 | +1. Validity |
| 30 | +2. Core storage - containing the raw unshredded data, which can be encoded in any way the child array's encoding. |
| 31 | +3. A shredded child. |
| 32 | + |
| 33 | +The "core storage" child may also have its own shredded child. |
| 34 | + |
| 35 | +### Execution |
| 36 | + |
| 37 | +When executing the expression on a variant array, it will pull out recursively shredded data until the path is exhausted OR the path reached a child path that isn't shredded. As we traverse the chain of shredded children along the path, we'll need to make sure to keep track of their validity, as the leaf child's validity is an OR of all of them. |
| 38 | + |
| 39 | +At this point, we have 3 possible cases: |
| 40 | + |
| 41 | +1. Perfectly shredded - there's a fully shredded child at this path. If it matches the expected type or can be casted into it, we can just return it. |
| 42 | +2. Partially shredded - data for this path exists in both the shredded child AND in some unshredded values, which we can merge according to the expected type. |
| 43 | +3. Unshredded - No shredded child at this path, we try and extract the relevant value from the unshredded values which are unchanged from the original array. |
| 44 | + |
| 45 | +The important invariant is that `VariantGet` changes the typed child selected for the requested |
| 46 | +path, but it does not rewrite the raw unshredded data. The raw storage continues to represent the |
| 47 | +same original variant values and can still be used by later `VariantGet` expressions for paths that |
| 48 | +were not shredded. |
| 49 | + |
| 50 | +```text |
| 51 | +Variant array before VariantGet("$.a.b", i64) |
| 52 | +
|
| 53 | ++--------------------------------------------------------------+ |
| 54 | +| validity | |
| 55 | +| raw unshredded data --------------------------------------+ | |
| 56 | +| shredded children | | |
| 57 | +| $.a.b: utf8 / missing / partially materialized | | |
| 58 | +| $.x.y: bool | | |
| 59 | ++----------------------------------------------------------|---+ |
| 60 | + | |
| 61 | +VariantGet("$.a.b", i64) | unchanged |
| 62 | + | |
| 63 | ++----------------------------------------------------------|---+ |
| 64 | +| validity for rows where $.a.b can be read as i64 | | |
| 65 | +| raw unshredded data <------------------------------------+ | |
| 66 | +| typed child: i64 values for $.a.b | |
| 67 | +| built from shredded data, raw data, or a merge of both | |
| 68 | ++--------------------------------------------------------------+ |
| 69 | +``` |
| 70 | + |
| 71 | +## Compatibility |
| 72 | + |
| 73 | +TODO: Explain compatibility concerns. |
| 74 | + |
| 75 | +- Does this change the file format or wire format? Is it backward or forward compatible? |
| 76 | +- Does this break any public APIs? If so, what is the migration path? |
| 77 | +- Are there performance implications? |
| 78 | + |
| 79 | +If there are no compatibility concerns, briefly state why. |
| 80 | + |
| 81 | +## Drawbacks |
| 82 | + |
| 83 | +TODO: Explain the cost of this change. |
| 84 | + |
| 85 | +- Why should we not do this? |
| 86 | +- What is the maintenance cost? |
| 87 | +- Does this add complexity that could be avoided? |
| 88 | + |
| 89 | +## Alternatives |
| 90 | + |
| 91 | +We can make the dtype parameter required, but I do think that the optional one keeps execution more flexible and opens up |
| 92 | +opportunities for different usage, which is useful for compute engines that have more flexible type systems or that might want |
| 93 | +to process the raw byte data themselves. |
| 94 | + |
| 95 | +## Prior Art |
| 96 | + |
| 97 | +See the [Variant RFC](https://github.com/vortex-data/rfcs/blob/develop/rfcs/0015-variant-type.md). |
| 98 | + |
| 99 | +## Unresolved Questions |
| 100 | + |
| 101 | +TODO: List open questions for the RFC process. |
| 102 | + |
| 103 | +- What parts of the design still need to be resolved? |
| 104 | +- What is explicitly out of scope? |
| 105 | +- What can be deferred to implementation? |
| 106 | + |
| 107 | +## Future Possibilities |
| 108 | + |
| 109 | +TODO: Capture natural extensions or follow-on work that are out of scope for this RFC. |
0 commit comments