ClickHouse native format integration (vortex-clickhouse crate) #6425
Replies: 1 comment 4 replies
-
|
Hello, thank you for looking into this! I think the high-level architecture is reasonable. One thing to note is that I am experimenting with porting our existing engine integrations over to the new Scan API. Both the producer and consumer side of this API live on some branches of mine, but I do think it is worth you looking at this API and flagging anything you think may prevent good integration with ClickHouse. The API is in flux so now's a good time! See: https://docs.vortex.dev/concepts/scanning On the extensions questions: yes, Vortex should ship with a default extension type registry. @connortsui20 is working to overhaul the APIs for doing this to make them more expressive. We may well be missing FixedSizeBinary as a native DType to support e.g. UUID. Depends on how philosophical we want to get about "logical" types. It will obviously take some time to look through the PR! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Proposal
I'd like to contribute a new
vortex-clickhousecrate that enables ClickHouse to nativelyread and write Vortex files via
FORMAT Vortex.Motivation
ClickHouse is one of the most widely deployed OLAP databases. Native Vortex format support
would allow ClickHouse users to directly query and produce Vortex files, benefiting from
Vortex's adaptive encoding without requiring external conversion pipelines.
This is similar in spirit to the existing
vortex-datafusionintegration but targetsClickHouse's format system instead of DataFusion's TableProvider.
Integrating Vortex as a native format in ClickHouse benefits the Vortex ecosystem in several ways:
Broad adoption. ClickHouse has a massive user base across ad-tech, observability, and analytics. Native support lets users query Vortex files with zero setup via
SELECT * FROM file('data.vortex', 'Vortex'), lowering the adoption barrier significantly.Interchange format. Combined with the existing
vortex-datafusionintegration, ClickHouse support positions Vortex as a viable interchange format between analytical engines — data flows end-to-end without decoding/re-encoding.Production-scale stress testing. ClickHouse deployments handle petabyte-scale data with high concurrency, exercising Vortex's reader/writer and type system under real workloads that benchmarks alone cannot replicate.
Extension type validation. ClickHouse's rich type system (Int128/256, UUID, IPv4/6, Geo, Enum, DateTime with timezone, LowCardinality, etc.) serves as a comprehensive real-world test case for Vortex's extension type design.
Comparative benchmarking. ClickHouse's mature Parquet support provides a natural baseline for apples-to-apples Vortex vs Parquet comparisons on compression, scan throughput, and query latency.
Design
The crate compiles to a C static library and exposes an opaque-handle-based C FFI.
On the ClickHouse side, a thin C++ shim implements
IInputFormat/IOutputFormatby calling into Rust through this FFI — the same pattern used by other Rust integrations
in ClickHouse (BLAKE3, skim, etc.).
Key components:
VortexScanner): Opens a Vortex file (local or remote via object_store),exposes schema metadata, supports column projection pushdown, and yields batches for
zero-copy export into ClickHouse columns.
VortexWriter): Streaming writes via a bounded channel to a backgroundasync task, keeping memory usage bounded.
Types without native Vortex equivalents (Int128/256, UUID, IPv4/6, Geo, Enum, DateTime
with timezone, FixedString, LowCardinality) are modeled as Vortex extension types with
custom metadata for lossless round-trip.
Current Status
I have a working implementation with 225 passing tests covering type conversion, data
export, extension types, and end-to-end read/write cycles. The crate is ready for
initial review.
Questions for the Community
Most OLAP systems have specialized types beyond the primitive/string/struct basics —
Map, Array, Geo, UUID, IPv4/6, Decimal variants, etc. — but each system defines them
differently (e.g., ClickHouse's
IPv6is 16-byte fixed, DuckDB stores it as HUGEINT,Spark uses binary).
In this integration, I modeled ClickHouse-specific types as Vortex extension types
with custom metadata. However, this means a Vortex file written by ClickHouse with
IPv6orUUIDcolumns uses ClickHouse-flavored extensions that other systemswouldn't recognize out of the box.
Would the Vortex project consider establishing a common extension type registry —
a set of standardized extension types (UUID, IPv4/6, Geo, JSON, etc.) with canonical
storage layouts that all integrations map to? This way:
DuckDB, or any future integration without special handling.
rather than each defining its own incompatible variants.
across systems.
I'm happy to refactor the ClickHouse extension types to align with such a registry
if one exists or is planned.
AI Disclosure
AI tools were used during this contribution for: understanding the existing Vortex codebase,
designing the integration architecture, implementing parts of the function-level code, and
generating test cases. All output has been reviewed, understood, and validated by me.
Happy to share the draft PR for early feedback. Looking forward to your thoughts!
Beta Was this translation helpful? Give feedback.
All reactions