Skip to content

Tracking Issue: cuDF support #8100

@0ax1

Description

@0ax1

- LLM generated -

Current state:

Relevant code:

  • vortex-cuda/src/arrow/mod.rs
  • vortex-cuda/src/arrow/canonical.rs
  • vortex-test/e2e-cuda/src/lib.rs

Current support:

  • exports some canonical arrays as ArrowDeviceArray
  • supports primitive/bool/decimal/temporal/string-view/struct paths partially
  • has an existing cuDF e2e harness

Current gaps:

  • no nullable/null-mask support
  • incomplete dtype coverage
  • no ArrowDeviceArrayStream
  • no PyVortex Arrow Device PyCapsule support
  • no to_cudf() API

Plan

1. cuDF compatibility baseline

  • Track cuDF version containing rapidsai/cudf#22620.
  • Gate string-view tests on that version.
  • Keep regression coverage for producer-owned ArrowArray.private_data.

2. Harden existing Arrow Device export

  • Audit device_type, device_id, sync_event, reserved, release callbacks, child ownership, and private_data.
  • Add tests for release idempotency, nested children, sliced arrays, and producer-owned private data.

3. Add null-mask support

  • Export Vortex validity bitmaps as CUDA buffers.
  • Populate Arrow null bitmap slots and null_count.
  • Support nulls for primitives, bools, decimals, temporals, strings, structs, and nested types.

4. Complete cuDF-compatible dtype coverage

  • Verify all primitive widths.
  • Finish decimal support.
  • Expand Utf8View / BinaryView tests:
    • inline/out-of-line values
    • multiple variadic buffers
    • non-ASCII UTF-8
    • sliced arrays
  • Add list/fixed-size-list export where cuDF supports the Arrow layout.
  • Return clear errors for unsupported dtypes.

5. Implement ArrowDeviceArrayStream

  • Add bindings for ArrowDeviceArrayStream.
  • Implement get_schema, get_next, get_last_error, and release.
  • Expose scan-to-device-stream from vortex-cuda.
  • Ensure stream batches stay on one CUDA device.

6. Expand cuDF e2e tests

  • Cover nullable primitives/strings.
  • Cover bools.
  • Cover decimals.
  • Cover temporals.
  • Cover structs/nested structs.
  • Cover lists/fixed-size lists.
  • Cover sliced arrays.
  • Cover multi-batch streams.
  • Cover string-view private-data regression.

7. Add optional PyVortex CUDA support

  • Add optional cuda feature to vortex-python.
  • Keep default Python wheels CUDA-free.
  • Expose CUDA-only APIs only when built with CUDA support.

8. Implement Python Arrow Device protocols

  • Add array.__arrow_c_device_array__(requested_schema=None, **kwargs).
  • Add scan.__arrow_c_device_stream__(requested_schema=None, **kwargs).
  • Return arrow_schema PyCapsules.
  • Return arrow_device_array PyCapsules.
  • Return arrow_device_array_stream PyCapsules.
  • Ensure correct capsule destructors and release semantics.

9. Add to_cudf() APIs

Potential APIs:

vx.array([1, 2, 3]).to_cudf()
vx.open("data.vortex").to_cudf()
vx.open("data.vortex").prepare(...).to_cudf()

Tasks:

  • Lazy-import cuDF.
  • Do not make cuDF a hard dependency.
  • Do not silently fall back to host Arrow by default.

10. Docs and benchmarks

  • Document supported dtypes, limitations, and required cuDF version.
  • Add examples for Vortex file/scan → cuDF.
  • Benchmark host Arrow path vs Arrow Device path.

Open questions

  • What stable cuDF Python API should consume Arrow Device PyCapsules?
  • Does cuDF support Arrow ListView, or should Vortex export standard list offsets?
  • Should non-struct arrays return cudf.Series or one-column DataFrame?
  • Should CUDA PyVortex ship as a separate wheel/package?
  • Should host fallback require an explicit flag?

Definition of done

  • cuDF imports Vortex Arrow Device arrays for all supported dtypes.
  • Nulls and nested columns work.
  • String/binary views work with producer-owned private_data.
  • Vortex scans export as ArrowDeviceArrayStream.
  • PyVortex supports Arrow Device PyCapsules.
  • PyVortex exposes to_cudf().
  • CI validates real Vortex → Arrow Device → cuDF paths.

Metadata

Metadata

Assignees

Labels

tracking-issueShared implementation context for work likely to span multiple PRs.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions