feat(raster): transparent OutDb byte loading via a GDAL loader hook#849
feat(raster): transparent OutDb byte loading via a GDAL loader hook#849james-willis wants to merge 4 commits into
Conversation
9695f7e to
05e1af8
Compare
cb05272 to
392ea09
Compare
Adds a process-wide OutDb loader hook (OnceLock<fn>) in sedona-raster so
non-GDAL consumers (Arrow FFI, numeric kernels) can resolve pixels from
schema-OutDb bands transparently. BandRefImpl's three byte-access methods
(data, nd_buffer, contiguous_data) route through one private
source_bytes() helper that yields either the zero-copy Arrow column
(schema-InDb) or loader-resolved bytes anchored in a per-band OnceCell
(schema-OutDb). No new trait, no caller-visible API change.
Reframes is_indb() as a schema-level discriminator ("does the Arrow data
column carry these bytes inline?") rather than a runtime byte-availability
check; trait-default doc updated and the structural fix to drop the
default is tracked as DB-21. BandRefImpl already overrides correctly by
inspecting the column directly.
The post-load check_view_buffer_bounds invocation against loader output
mirrors the construction-time check the InDb path runs; a mismatched
source_shape vs bytes-length combination errors cleanly instead of
panicking inside materialize_strided.
9 new unit tests in outdb_loader.rs use a single URI-dispatched mock
loader (OnceLock allows exactly one registration per process) to cover:
loader bytes flowing through nd_buffer / contiguous_data / data, is_indb
staying false post-load, OnceCell caching across calls, missing outdb_uri,
loader errors, undersized loader output (bound-check trip), and the
data() infallible shim collapsing errors to &[].
Implements the GDAL backend for the OutDb loader hook introduced in the previous commit. `register_outdb_loader()` installs `gdal_load` as the process-wide handler; SedonaContext::new() calls this once per process via the next commit. gdal_load parses the `#band=N` URI fragment with the existing `parse_outdb_source`, retrieves the source dataset through the thread-local GDALDatasetCache (so the underlying GDAL Dataset is opened at most once per (thread, URI)), and reads the requested band as row-major bytes via `read_as_bytes`. 2-D only for v1; higher-rank reads need MDArray support and are out of scope. Required visibility bumps: - `GDALDatasetCache::get_or_create_outdb_source` → pub(crate) - `mod source_uri;` un-gated from `#[cfg(test)]` - new `arrow-schema` dep so the loader can build `ArrowError`s `#[allow(dead_code)]` on `gdal_common` / `gdal_dataset_provider` is kept; the loader uses a subset (with_gdal, convert_gdal_err, thread_local_cache, get_or_create_outdb_source) but the broader VRT machinery still awaits its consumers (issue apache#804). Three integration tests using temp-file GeoTIFFs cover the happy path, intra-band cache reuse on repeated calls, and `#band=N` selection on multi-band sources.
Wires sedona_raster_gdal::register_outdb_loader() into SedonaContext::new_from_context() after the existing raster-functions registration so every session has the OutDb byte path available without explicit setup. The registration is idempotent (`OnceLock::set` is first-write-wins), so repeated SedonaContext::new() calls in one process are safe.
`BandRef::nd_buffer()` and `BandRef::contiguous_data()` now take a caller-provided `&mut Vec<u8>` instead of allocating internally. Schema-InDb bands return a zero-copy slice of the Arrow column and leave the scratch untouched; schema-OutDb bands invoke the process-wide loader, which writes into the scratch, and return a slice of the scratch. The pre-N-D `BandRef::data() -> &[u8]` infallible accessor is removed: it had no production callers (every existing call site was inside `#[cfg(test)]` or in `sedona-testing`'s test helpers), and its silent-`&[]`-on-error behaviour was the kind of trap the new scratch contract eliminates. The `BandRefImpl::outdb_loaded: OnceCell<Vec<u8>>` per-band anchor is gone along with `data()`: scratch lifetime now belongs to the caller, so the cell that previously anchored loader bytes for the band's lifetime is unnecessary. `OutDbBandLoader` fn-pointer signature changes accordingly: the loader appends its bytes to a caller-provided `&mut Vec<u8>` rather than returning `Result<Vec<u8>, _>`. The internal dispatcher `load_outdb` clears scratch on entry and validates the loader's written length on return, so backend implementations need only write — they don't have to inspect pre-call state or zero the buffer. This matches the project's standing convention (the same reason geometry writers serialise WKB directly into the Arrow output instead of returning `Vec<u8>`): a single scratch buffer can be reused across iterations of a scalar-function loop, and on the day the planner reserves raster scratch through DataFusion's memory pool, the path is already in shape — caller controls the allocation, kernels just borrow.
392ea09 to
b297bb4
Compare
|
Converting back to draft while the design pivots. Three motivations driving the rewrite: No magic loading. Today byte fetching happens transparently inside Async UDF perf gains. DataFusion 52.5's Plugin story for additional formats. The single global function-pointer registry becomes a format-keyed trait-object registry living in Survives from this PR as-is: the GDAL backend impl and the session-bootstrap registration scaffolding. Goes away: the transparent dispatch hook on the |
|
Superseded by #886. The design pivoted from a transparent-sync hook on
The GDAL byte-read core from this PR survives as Closing this PR rather than rebasing because the trait surface and registry shape differ enough that a fresh history reviews better than a morph. |
Summary
Make
BandRef::nd_buffer()andBandRef::contiguous_data()Just Work for OutDb bands (bands whose Arrowdatacolumn is empty and whoseoutdb_uripoints elsewhere). Today those methods returnNotYetImplemented, blocking every UDF and downstream consumer from reading OutDb pixel bytes — the entire reason OutDb references exist as a schema feature is moot without a working byte path.Approach: a single statically-typed function-pointer hook in
sedona-raster, populated bysedona-raster-gdalat session bootstrap. No trait, no HashMap-keyed registry, no async — "compiled in, not pluggable". The byte-access surface is caller-managed scratch: methods take a&mut Vec<u8>rather than allocating internally, so consumers in scalar-function loops can reuse a single buffer across rasters and (eventually) reserve memory through DataFusion's memory pool. The only abstraction introduced is the one the crate-boundary forces (sedona-raster-gdaldepends onsedona-raster, not the reverse, so a direct call fromBandRefImplto GDAL is impossible).Byte-access surface
InDb identity-view bands return a zero-copy slice of the Arrow column and leave
scratchuntouched. OutDb bands invoke the process-wide loader, which writes intoscratch; the returned slice borrows fromscratch. Callers don't branch on InDb-vs-OutDb — they pass the sameVec<u8>for every band, capacity grows to the largest band's source size and then stabilises.The pre-N-D infallible accessor
BandRef::data() -> &[u8]is removed: every existing call site was inside#[cfg(test)]or insedona-testing's test helpers (no production callers), and its silent-&[]-on-error behaviour was the kind of trap the new scratch contract eliminates.OutDb loader hook
The loader appends exactly
Π source_shape × data_type.byte_size()bytes to the caller-providedscratch. The internalload_outdbdispatcher clears the scratch on entry and validates the loader's written length on return, so backend implementations only need to write — they don't have to inspect pre-call state or zero the buffer.sedona-raster-gdal::register_outdb_loader()is called fromSedonaContext::new_from_context()next to the existing function-set registration. Registration is idempotent (OnceLock::setis first-write-wins) so repeatedSedonaContext::new()calls in one process are safe.Changes
rust/sedona-raster/src/outdb_loader.rs—OutDbLoadRequest,OutDbBandLoaderfn-pointer type,OnceLock,set_outdb_band_loader, internalload_outdbdispatcher.rust/sedona-raster/src/array.rs—BandRefImplbyte-access methods rewritten to take scratch; newsource_bytes(&mut scratch)helper unifies the InDb (zero-copy column) and OutDb (loader writes into scratch) paths. Overflow-safe expected-size check (checked_mul+try_into::<usize>). No per-band cache field — the lifetime anchor is the caller's scratch.rust/sedona-raster/src/traits.rs—BandRef::is_indb()is now a required method (no default impl, since the previous default derived fromnd_buffer()emptiness which misclassifies OutDb bands as InDb once a loader is installed). New scratch-awarend_buffer/contiguous_datasignatures;data()method removed.rust/sedona-raster/src/lib.rs— re-exportsset_outdb_band_loader,OutDbBandLoader,OutDbLoadRequest.rust/sedona-raster-gdal/src/outdb_loader.rs—gdal_loadfn +register_outdb_loader. 2-D[y,x]enforcement, pre-read pixel-type check (errors cleanly if the file's GDAL pixel type differs from the band metadata's claim), usesGDALDatasetCache::get_or_create_outdb_source. Writes the read bytes into the caller's scratch.rust/sedona-raster-gdal/src/gdal_common.rs—raster_ref_to_gdal_memmigrated tocontiguous_data(&mut scratch). The function still rejects OutDb above the loop, so the scratch path is never exercised — but a defensiveif !scratch.is_empty()assert after each iteration catches any future regression loudly: if a relaxed OutDb check ever lets the loader write into scratch, the captureddata_ptrwould alias scratch and could dangle next iteration. The assert fires before that's possible.rust/sedona/src/context.rs— callssedona_raster_gdal::register_outdb_loader()fromSedonaContext::new_from_context().~40test sites insedona-raster,sedona-raster-gdal,sedona-raster-functions, andsedona-testingfromband.data()/band.contiguous_data().unwrap()to the scratch-aware shape.Open question — hard GDAL dependency on
sedonaHeads-up for reviewers: this PR adds
sedona-raster-gdalas a direct dependency ofsedona(rust/sedona/Cargo.toml:84), soSedonaContext::new()can register the GDAL OutDb loader at bootstrap. Before this PR, thesedonacrate did not linksedona-raster-gdalat all — GDAL was strictly opt-in. After this PR, buildingsedona(and anything that depends on it, e.g. the Python wheel and R bindings) requireslibgdal-devat link time, unconditionally.The cleanest fix is to feature-gate the registration:
That keeps the default behaviour (GDAL linked, OutDb works out of the box for SQL users) while letting downstream consumers
--no-default-featuresbuild a slim sedona without GDAL — same flexibility they had before this PR.Calling this out for explicit reviewer judgment rather than fixing it inline, because the answer depends on what level of optionality the project wants to preserve. Happy to apply the feature gate (or any variant) in a follow-up commit on this branch.
Tests
sedona-raster(mock loader, 9 unit tests): loader registration, round-trip viand_buffer()/contiguous_data(), scratch reuse across bands, missing-uri error, undersized-loader-output error, loader-failure propagation,is_indbinvariant after loader runs.sedona-raster(process-isolated integration):tests/no_loader_registered.rs— separate test binary soOUTDB_BAND_LOADERstays unset; asserts the "no OutDb loader registered" diagnostic message verbatim. Catches regressions inSedonaContext::new_from_context()'s registration call without needing GDAL or a SQL engine to reproduce. Reviewers: see the "Hard GDAL dependency" note above — if we feature-gate, this test stays useful for the slim-build path.sedona-raster-gdal(real GDAL, 7 integration tests): write real GeoTIFFs to temp dirs and verify end-to-end byte path, scratch reuse across calls, multi-band#band=Nselection, non-[y,x]dim-name rejection, dtype mismatch rejection, missing-file error, band-index-out-of-range error.End-to-end SQL coverage
A SQL-layer test along the lines of
SELECT RS_Value(raster, x, y) FROM outdb_geotiff_tablewould also exercise this path, but no byte-readingRS_*function exists yet — the currentRS_*surface (RS_BandPath,RS_NumBands,RS_BandPixelType,RS_Envelope, …) is all metadata, none of them callband.nd_buffer(). SQL-layer coverage is deferred until a byte-reading kernel lands. The 7 real-GDAL integration tests are the right test layer until then.Relationship to PR #813 (view machinery)
This PR is independent of PR-D (#813 — view machinery /
materializedcell / strided walk) and based directly onmain. The view-composition path remains rejected at construction inRasterRefImpl::band(), so the OutDb byte path here only needs to handle the identity-view case.The scratch-buffer convention here aligns with the review feedback on #813 about avoiding hidden allocations on band structures — the OutDb byte path no longer hides allocation in an
OnceCell<Vec<u8>>on the band, it threads the caller's scratch through the byte-access methods.Test plan
cargo test -p sedona-raster --lib(72 passing)cargo test -p sedona-raster --tests no_loader_registered(process-isolated, 1 passing)cargo test -p sedona-raster-gdal --lib outdb_loader(7 passing)cargo test -p sedona-raster-functions --lib(143 passing)cargo test -p sedona-testing --lib(78 passing)cargo clippy --all-targets -p sedona-raster -p sedona-raster-gdal -p sedona-raster-functions -p sedona-testing -- -D warningscargo fmt --all -- --check