Support for importing Arrow extension arrays into core vortex arrays and ParquetVariant support.#8125
Support for importing Arrow extension arrays into core vortex arrays and ParquetVariant support.#8125AdamGS wants to merge 1 commit into
Conversation
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | chunked_varbinview_opt_into_canonical[(1000, 10)] |
240.1 µs | 203.8 µs | +17.84% |
| ❌ | Simulation | chunked_varbinview_opt_canonical_into[(1000, 10)] |
187.8 µs | 224.8 µs | -16.44% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing adamg/parquet-arrow-import-export (27fbac5) with develop (8265aa1)
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.111x ❌ datafusion / vortex-file-compressed (1.111x ❌, 0↑ 6↓)
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.014x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.006x ➖, 1↑ 0↓)
datafusion / parquet (0.997x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.965x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.976x ➖, 0↑ 0↓)
duckdb / parquet (1.063x ➖, 0↑ 2↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.005x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.016x ➖, 0↑ 0↓)
datafusion / parquet (0.992x ➖, 1↑ 2↓)
datafusion / arrow (1.018x ➖, 0↑ 2↓)
duckdb / vortex-file-compressed (1.009x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.999x ➖, 0↑ 0↓)
duckdb / parquet (0.987x ➖, 2↑ 1↓)
duckdb / duckdb (0.997x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMENo file size changes detected. |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.938x ➖, 18↑ 0↓)
datafusion / vortex-compact (0.947x ➖, 14↑ 0↓)
datafusion / parquet (0.946x ➖, 13↑ 1↓)
duckdb / vortex-file-compressed (0.934x ➖, 18↑ 0↓)
duckdb / vortex-compact (0.939x ➖, 14↑ 0↓)
duckdb / parquet (0.959x ➖, 2↑ 0↓)
duckdb / duckdb (0.963x ➖, 10↑ 4↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.993x ➖, 1↑ 0↓)
datafusion / vortex-compact (1.121x ➖, 0↑ 1↓)
datafusion / parquet (1.031x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.064x ➖, 0↑ 2↓)
duckdb / vortex-compact (0.927x ➖, 1↑ 0↓)
duckdb / parquet (1.004x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (1.052x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.029x ➖, 0↑ 1↓)
duckdb / parquet (1.043x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: Random AccessVortex (geomean): 1.100x ➖ unknown / unknown (1.119x ❌, 3↑ 27↓)
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.076x ➖, 0↑ 8↓)
datafusion / vortex-compact (1.097x ➖, 0↑ 7↓)
datafusion / parquet (1.097x ➖, 0↑ 9↓)
datafusion / arrow (1.087x ➖, 0↑ 8↓)
duckdb / vortex-file-compressed (1.157x ❌, 0↑ 13↓)
duckdb / vortex-compact (1.072x ➖, 0↑ 9↓)
duckdb / parquet (1.020x ➖, 0↑ 0↓)
duckdb / duckdb (1.018x ➖, 0↑ 1↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.015x ➖, 0↑ 3↓)
datafusion / parquet (0.989x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.993x ➖, 4↑ 1↓)
duckdb / parquet (1.003x ➖, 0↑ 0↓)
duckdb / duckdb (0.993x ➖, 0↑ 1↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.882x ➖, 6↑ 4↓)
datafusion / vortex-compact (1.016x ➖, 0↑ 1↓)
datafusion / parquet (1.039x ➖, 2↑ 5↓)
duckdb / vortex-file-compressed (0.897x ➖, 1↑ 0↓)
duckdb / vortex-compact (0.868x ➖, 1↑ 0↓)
duckdb / parquet (0.887x ➖, 0↑ 0↓)
Full attributed analysis
|
|
do we know all of the possible arrow extension types that are considered not extension by Vortex? Is it just Variant? |
Benchmarks: CompressionVortex (geomean): 0.990x ➖ unknown / unknown (0.989x ➖, 0↑ 0↓)
|
| let int_array: ArrowArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3])); | ||
| let result = <Vector as ArrowImportVTable>::from_arrow_array(&Vector, int_array, &ext)?; | ||
| let result = | ||
| <Vector as ArrowImportVTable>::from_arrow_array(&Vector, int_array, &field, &dtype)?; |
There was a problem hiding this comment.
It does feel a bit strange that we pass in both the Vector vtable as well as the extension dtype here? Maybe we don't care
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.936x ➖, 2↑ 2↓)
datafusion / vortex-compact (0.914x ➖, 1↑ 1↓)
datafusion / parquet (1.023x ➖, 2↑ 5↓)
duckdb / vortex-file-compressed (0.984x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.065x ➖, 0↑ 1↓)
duckdb / parquet (0.946x ➖, 1↑ 0↓)
Full attributed analysis
|
|
@connortsui20 8-bit boolean too. |
| ParquetVariant.id() | ||
| } | ||
|
|
||
| fn to_arrow_field( |
There was a problem hiding this comment.
Don’t you need to fix this as well to handle non ext dtypes?
Summary
Allows importing/exporting vortex and arrow arrays in the case where the arrow side is an extension but Vortex isn't.
This PR also includes the motivating change - support for ParquetVariant import/export.
API Changes
Changes the signature of
ArrowImportVTable::from_arrow_arrayto include both theField(which has the full arrow-side type info) and a full vortexDType.