Skip to content

feat(parquet-variant): add Dictionary and REE variant_to_arrow support#10014

Open
mneetika wants to merge 1 commit into
apache:mainfrom
mneetika:feature/variant-dictionary-support
Open

feat(parquet-variant): add Dictionary and REE variant_to_arrow support#10014
mneetika wants to merge 1 commit into
apache:mainfrom
mneetika:feature/variant-dictionary-support

Conversation

@mneetika
Copy link
Copy Markdown

@mneetika mneetika commented May 24, 2026

Which issue does this PR close?

Rationale for this change

variant_get / variant_to_arrow can already convert Variant values into many native Arrow array layouts, but requesting DataType::Dictionary or DataType::RunEndEncoded was not supported.

This PR adds support for those output encodings without changing Variant shredding semantics. Dictionary and RunEndEncoded are produced as Arrow result arrays only; they are not introduced as valid Parquet Variant shredded typed_value layouts.

What changes are included in this PR?

  1. Adds an encoded output builder in variant_to_arrow for DataType::Dictionary and DataType::RunEndEncoded.
  2. Builds the logical child value array using the existing Variant-to-Arrow builders, then delegates the final Dictionary/REE encoding to Arrow's existing cast kernels.
  3. Adds variant_get regression coverage for string dictionary, numeric dictionary, and run-end encoded outputs.

Are these changes tested?

Yes:

  • cargo fmt --check
  • cargo test -p parquet-variant-compute
  • cargo test -p parquet-variant
  • cargo clippy --workspace --all-targets

Are there any user-facing changes?

Yes. variant_get with as_type set to DataType::Dictionary or DataType::RunEndEncoded can now return those Arrow array encodings.

@github-actions github-actions Bot added the parquet-variant parquet-variant* crates label May 24, 2026
Copy link
Copy Markdown
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'd love a second review from @sdf-jkl or @codephage2020 as a sanity check

@sdf-jkl
Copy link
Copy Markdown
Contributor

sdf-jkl commented May 26, 2026

Hey @scovich, the PR introduces unshred support for Variants where typed_value is Dict/REE, which is not permitted by the spec.

The issue I created is for variant_to_arrow (kind of opposite of unshredding, but not fully).

Seems like AI slop to me.

@scovich
Copy link
Copy Markdown
Contributor

scovich commented May 26, 2026

the PR introduces unshred support for Variants where typed_value is Dict/REE, which is not permitted by the spec

🤦 I keep forgetting that, thanks for catching it.

@mneetika mneetika force-pushed the feature/variant-dictionary-support branch from 7fcd57f to 87fbf4b Compare May 26, 2026 22:47
@mneetika mneetika changed the title feat(parquet-variant): add dictionary and run-end encoded support to … feat(parquet-variant): add Dictionary and REE variant_to_arrow support May 26, 2026
@mneetika
Copy link
Copy Markdown
Author

@scovich Thanks for catching this, and apologies for the incorrect update.

You were right that Dictionary / RunEndEncoded are not valid Parquet Variant shredded typed_value layouts, so adding support in unshred_variant was wrong.

I have updated the PR to target the actual issue instead: variant_to_arrow / variant_get(as_type=...) output support for DataType::Dictionary and DataType::RunEndEncoded. The implementation now builds the logical value array first and delegates the final Dictionary/REE encoding to Arrow’s existing cast kernels.

I also updated the PR title/body and added regression tests for string dictionary, numeric dictionary, and run-end encoded outputs.

Again apologies for the incorrect PR.

@mneetika mneetika requested a review from scovich May 26, 2026 23:07
Copy link
Copy Markdown
Contributor

@sdf-jkl sdf-jkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mneetika, LGTM. Sorry about the rash comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Add variant_to_arrow Dictionary/REE type support

3 participants