Skip to content

feat(sedona-raster-zarr): cloud storage backends (s3, gcs, azure, http) via object_store#888

Draft
james-willis wants to merge 1 commit into
apache:mainfrom
james-willis:jw/zarr-cloud-reads
Draft

feat(sedona-raster-zarr): cloud storage backends (s3, gcs, azure, http) via object_store#888
james-willis wants to merge 1 commit into
apache:mainfrom
james-willis:jw/zarr-cloud-reads

Conversation

@james-willis
Copy link
Copy Markdown
Contributor

@james-willis james-willis commented May 29, 2026

Summary

Adds cloud storage backends to sedona-raster-zarr. ZarrChunkReader now accepts file://, bare paths, s3://, gs:// / gcs://, az:// / abfs:// / abfss://, and http:// / https://.

import sedonadb
import sedonadb_zarr

con = sedonadb.connect()
con.read_format(
    sedonadb_zarr.ZarrFormatSpec(),
    "s3://its-live-data/datacubes/v2/N40W120/ITS_LIVE_vel_EPSG32610_G0120_X250000_Y5450000.zarr",
).show()

All Async

This package has been migrated to use all async for record loading and byte retrieval. This allows a single approach (cloud was always going to be async, but file system was originally sync) without shimming sync fs code into the async RS_EnsureLoaded udf and async cloud access code into the sync record loading machinery.

Version pinning

zarrs_object_store is pinned to 0.5 in the workspace. 0.6 (the latest) depends on object_store 0.13, which is semver-incompatible with the object_store 0.12.x the rest of the workspace uses through DataFusion 52.

…ect_store

ZarrChunkReader now accepts file://, bare paths, s3://, gs:// / gcs://,
az:// / abfs:// / abfss://, and http:// / https://. The reader's
public API is pure-storage-in: hand it an
Arc<dyn AsyncReadableListableStorageTraits> and a group URI (for
chunk-anchor URI formatting) and it walks the chunk grid. Where the
storage came from — registry lookup, env-var-built builder, an
in-memory test fixture — is the caller's concern, not the reader's.

* `open_storage_from_uri(uri, store_override: Option<Arc<dyn ObjectStore>>)`
  is the helper for callers that don't already hold a credentialed
  store. `Some(store)` uses it directly (with PrefixStore for
  s3/gs/az; http(s) is rooted at the URL by its builder). `None`
  falls back to per-scheme *Builder::from_env().with_url(uri).build()
  — the same env-var credential discovery that read_format's
  orchestrated path gets via ensure_object_store_registered_with_options.
  file:// and bare paths always go through zarrs_filesystem's
  FilesystemStore wrapped in SyncToAsyncStorageAdapter, so the local
  backend shows up on the same async storage surface.
* ZarrChunkReader::try_new is async; the sync RecordBatchReader
  streaming surface is unchanged since next() is pure CPU.
* No dedicated tokio runtime; async tasks run on whatever runtime
  the caller is on (DataFusion's executor for the future SQL UDTF,
  an ad-hoc current-thread runtime block_on'd in PyZarrChunkReader::new
  for the Python FFI, #[tokio::test] in tests).
* PyZarrChunkReader::new builds storage via open_storage_from_uri
  (env-var credentials by default), then drives the async try_new
  through a local current-thread tokio runtime. No synthetic
  RuntimeEnv, no datafusion-execution dep on the FFI cdylib.
* Replaces Group::child_arrays (which opens every child up front and
  errors hard on any per-array metadata failure) with two purpose-
  built paths driven by the caller's arrays filter:
  - arrays: Some([...]) opens each by name with Array::async_open.
    No listing — usable against backends that can't list (plain HTTPS
    without WebDAV, S3-via-HttpStore).
  - arrays: None lists direct children with storage.list_dir, then
    Array::async_open each. Per-array open failures are logged at
    warn! and skipped, so a single malformed sibling (e.g. an
    xarray-style fixed-length-Unicode coord variable with a null
    fill_value that zarrs 0.23 can't open) no longer poisons the
    rest of the group.
* zarrs_object_store pinned to 0.5 in the workspace; 0.6 depends on
  object_store 0.13, semver-incompatible with DataFusion 52's
  object_store 0.12.x.
* Cloud smoke tests pass strictly against the public anonymous
  ITS_LIVE v2 ice-velocity datacubes (s3://its-live-data/...). Same
  bucket via s3:// (AmazonS3Builder) and via the virtual-hosted
  HTTPS URL (HttpStore), with an explicit M11/M12 filter to avoid
  listing on the HTTPS path.
@james-willis james-willis force-pushed the jw/zarr-cloud-reads branch from 4d3de71 to 3fbe051 Compare May 29, 2026 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant