`- LLM generated -` Current state: Relevant code: - `vortex-cuda/src/arrow/mod.rs` - `vortex-cuda/src/arrow/canonical.rs` - `vortex-test/e2e-cuda/src/lib.rs` Current support: - exports some canonical arrays as `ArrowDeviceArray` - supports primitive/bool/decimal/temporal/string-view/struct paths partially - has an existing cuDF e2e harness Current gaps: - no nullable/null-mask support - incomplete dtype coverage - no `ArrowDeviceArrayStream` - no PyVortex Arrow Device PyCapsule support - no `to_cudf()` API --- ## Plan ### 1. cuDF compatibility baseline - [ ] Track cuDF version containing `rapidsai/cudf#22620`. - [ ] Gate string-view tests on that version. - [ ] Keep regression coverage for producer-owned `ArrowArray.private_data`. ### 2. Harden existing Arrow Device export - [ ] Audit `device_type`, `device_id`, `sync_event`, `reserved`, release callbacks, child ownership, and `private_data`. - [ ] Add tests for release idempotency, nested children, sliced arrays, and producer-owned private data. ### 3. Add null-mask support - [ ] Export Vortex validity bitmaps as CUDA buffers. - [ ] Populate Arrow null bitmap slots and `null_count`. - [ ] Support nulls for primitives, bools, decimals, temporals, strings, structs, and nested types. ### 4. Complete cuDF-compatible dtype coverage - [ ] Verify all primitive widths. - [ ] Finish decimal support. - [ ] Expand `Utf8View` / `BinaryView` tests: - [ ] inline/out-of-line values - [ ] multiple variadic buffers - [ ] non-ASCII UTF-8 - [ ] sliced arrays - [ ] Add list/fixed-size-list export where cuDF supports the Arrow layout. - [ ] Return clear errors for unsupported dtypes. ### 5. Implement `ArrowDeviceArrayStream` - [ ] Add bindings for `ArrowDeviceArrayStream`. - [ ] Implement `get_schema`, `get_next`, `get_last_error`, and `release`. - [ ] Expose scan-to-device-stream from `vortex-cuda`. - [ ] Ensure stream batches stay on one CUDA device. ### 6. Expand cuDF e2e tests - [ ] Cover nullable primitives/strings. - [ ] Cover bools. - [ ] Cover decimals. - [ ] Cover temporals. - [ ] Cover structs/nested structs. - [ ] Cover lists/fixed-size lists. - [ ] Cover sliced arrays. - [ ] Cover multi-batch streams. - [ ] Cover string-view private-data regression. ### 7. Add optional PyVortex CUDA support - [ ] Add optional `cuda` feature to `vortex-python`. - [ ] Keep default Python wheels CUDA-free. - [ ] Expose CUDA-only APIs only when built with CUDA support. ### 8. Implement Python Arrow Device protocols - [ ] Add `array.__arrow_c_device_array__(requested_schema=None, **kwargs)`. - [ ] Add `scan.__arrow_c_device_stream__(requested_schema=None, **kwargs)`. - [ ] Return `arrow_schema` PyCapsules. - [ ] Return `arrow_device_array` PyCapsules. - [ ] Return `arrow_device_array_stream` PyCapsules. - [ ] Ensure correct capsule destructors and release semantics. ### 9. Add `to_cudf()` APIs Potential APIs: vx.array([1, 2, 3]).to_cudf() vx.open("data.vortex").to_cudf() vx.open("data.vortex").prepare(...).to_cudf() Tasks: - [ ] Lazy-import cuDF. - [ ] Do not make cuDF a hard dependency. - [ ] Do not silently fall back to host Arrow by default. ### 10. Docs and benchmarks - [ ] Document supported dtypes, limitations, and required cuDF version. - [ ] Add examples for Vortex file/scan → cuDF. - [ ] Benchmark host Arrow path vs Arrow Device path. --- ## Open questions - [ ] What stable cuDF Python API should consume Arrow Device PyCapsules? - [ ] Does cuDF support Arrow `ListView`, or should Vortex export standard list offsets? - [ ] Should non-struct arrays return `cudf.Series` or one-column `DataFrame`? - [ ] Should CUDA PyVortex ship as a separate wheel/package? - [ ] Should host fallback require an explicit flag? --- ## Definition of done - [ ] cuDF imports Vortex Arrow Device arrays for all supported dtypes. - [ ] Nulls and nested columns work. - [ ] String/binary views work with producer-owned `private_data`. - [ ] Vortex scans export as `ArrowDeviceArrayStream`. - [ ] PyVortex supports Arrow Device PyCapsules. - [ ] PyVortex exposes `to_cudf()`. - [ ] CI validates real Vortex → Arrow Device → cuDF paths.
- LLM generated -Current state:
Relevant code:
vortex-cuda/src/arrow/mod.rsvortex-cuda/src/arrow/canonical.rsvortex-test/e2e-cuda/src/lib.rsCurrent support:
ArrowDeviceArrayCurrent gaps:
ArrowDeviceArrayStreamto_cudf()APIPlan
1. cuDF compatibility baseline
rapidsai/cudf#22620.ArrowArray.private_data.2. Harden existing Arrow Device export
device_type,device_id,sync_event,reserved, release callbacks, child ownership, andprivate_data.3. Add null-mask support
null_count.4. Complete cuDF-compatible dtype coverage
Utf8View/BinaryViewtests:5. Implement
ArrowDeviceArrayStreamArrowDeviceArrayStream.get_schema,get_next,get_last_error, andrelease.vortex-cuda.6. Expand cuDF e2e tests
7. Add optional PyVortex CUDA support
cudafeature tovortex-python.8. Implement Python Arrow Device protocols
array.__arrow_c_device_array__(requested_schema=None, **kwargs).scan.__arrow_c_device_stream__(requested_schema=None, **kwargs).arrow_schemaPyCapsules.arrow_device_arrayPyCapsules.arrow_device_array_streamPyCapsules.9. Add
to_cudf()APIsPotential APIs:
Tasks:
10. Docs and benchmarks
Open questions
ListView, or should Vortex export standard list offsets?cudf.Seriesor one-columnDataFrame?Definition of done
private_data.ArrowDeviceArrayStream.to_cudf().