Skip to content

Support glob patterns in open_datatree(group=...) for selective group loading#11197

Open
aladinor wants to merge 35 commits intopydata:mainfrom
aladinor:glob-group-filtering
Open

Support glob patterns in open_datatree(group=...) for selective group loading#11197
aladinor wants to merge 35 commits intopydata:mainfrom
aladinor:glob-group-filtering

Conversation

@aladinor
Copy link
Contributor

Note: This PR depends on #10742 (async DataTree open) being merged first. It will have merge conflicts until then.

When the group parameter contains glob metacharacters (*, ?, [), filter which groups are opened instead of re-rooting the tree. This avoids loading the entire hierarchy when only a subset is needed.

Use cases

  • Radar data: xr.open_datatree("radar.nc", group="*/sweep_0") — load only the lowest elevation sweep from each volume scan
  • CMIP archives: xr.open_datatree("cmip.zarr", group="*/historical/tas") — load only temperature across all models

Changes

  • Added shared utilities _is_glob_pattern, _filter_group_paths, and _resolve_group_and_filter in common.py
  • Updated NetCDF4, H5NetCDF, and Zarr backends to use a discover → filter → open pipeline
  • Uses the same matching engine as DataTree.match() (PurePosixPath.match)
  • Root (/) and all ancestors of matched nodes are always included to form a valid tree

Behavior summary

group value Behavior
None Load all groups (unchanged)
"VCP-34" (no glob chars) Root selection (unchanged)
"*/sweep_0" (glob chars) Filter mode — only matched groups + ancestors
Pattern matches nothing Root-only tree

@github-actions github-actions bot added topic-backends topic-zarr Related to zarr storage library io labels Feb 22, 2026
aladinor and others added 27 commits March 25, 2026 11:30
Changes:
- Refactor open_datatree() to use zarr_sync() with async implementation
  for concurrent dataset and index creation across groups
- Add _open_datatree_from_stores_async() helper that opens datasets and
  creates indexes concurrently using asyncio.gather with a semaphore
  to limit concurrency (avoids deadlocks with stores like Icechunk)
- Add open_datatree_async() method for explicit async API
- Remove duplicate _maybe_create_default_indexes_async from zarr.py,
  now imports from api.py (single source of truth)

This significantly improves performance when opening DataTrees from
high-latency storage backends (e.g., ~2 seconds vs sequential loading).
Remove the asyncio.Semaphore that was limiting concurrency to 10
concurrent operations. Investigation showed:

- Zarr already has built-in concurrency control (async.concurrency=10)
- The semaphore only applied to asyncio.to_thread() calls, not zarr I/O
- Removing it improves performance by ~30-40% (~2s -> ~1.2-1.4s)

The semaphore was defensive code for a problem that doesn't exist -
zarr and icechunk handle their own concurrency limits internally.
The async implementation uses zarr.core.sync which only exists in
zarr v3. Add a conditional check using _zarr_v3() to:
- Use async path with zarr_sync() for zarr v3 (concurrent loading)
- Fall back to sequential loading for zarr v2

This fixes CI failures on min-versions environment which uses zarr v2.
The previous commit (0ee2a73) removed the semaphore thinking zarr handles
its own concurrency, but icechunk can deadlock when too many asyncio.to_thread()
calls try to access it simultaneously. This was discovered when testing with
larger stores (23+ groups) where all threads would start but never complete.

The semaphore limits concurrent to_thread calls to 10, which prevents the
deadlock while still providing significant performance benefits over sequential
loading.
- Add helper methods _build_group_members and _create_stores_from_members
  to reduce code duplication between sync and async store opening
- Use zarr_sync() to run async index creation in _datatree_from_backend_datatree
  for zarr engine, making open_datatree fully async behind the scenes
- Fix missing chunks validation and source encoding in open_datatree_async
- Add tests for chunks validation, source encoding, and chunks parameter
- Add type annotations to nested async functions in _datatree_from_backend_datatree
  to fix mypy annotation-unchecked notes breaking pytest-mypy-plugins tests
- Use os.path.join and os.path.normpath in test_async_source_encoding
  for cross-platform compatibility on Windows
Add type annotations to _maybe_create_default_indexes_async and its
nested functions (load_var, create_index, _create) to satisfy mypy's
annotation-unchecked checks. Also add Variable and Hashable imports
to the TYPE_CHECKING block.

This fixes pytest-mypy-plugins tests that were failing due to mypy
emitting annotation-unchecked notes for untyped nested functions.
- Remove open_datatree_async() from api.py (public API)
- Remove open_datatree_async() from zarr.py (backend method)
- Keep internal async optimization in _datatree_from_backend_datatree()
- Use _zarr_v3() for proper zarr version check instead of ImportError
- Update tests to only test internal async functionality
- Add test to verify sync open_datatree uses async internally for zarr v3

The async optimization is now internal only - users call the sync
open_datatree() which automatically uses async index creation for
zarr v3 backends.
Co-authored-by: Justus Magin <keewis@users.noreply.github.com>
Benchmarking showed async index creation provides no measurable benefit
since it's CPU-bound work. Simplified to sync loop per reviewer feedback.
Co-authored-by: Justus Magin <keewis@users.noreply.github.com>
Co-authored-by: Justus Magin <keewis@users.noreply.github.com>
- Replace asyncio.gather with asyncio.TaskGroup for better error handling
  (cancels outstanding tasks on error)
- Add max_concurrency parameter to open_datatree for controlling parallel
  I/O operations (defaults to 10)
- Add StoreBackendEntrypoint.open_dataset_async method
- Add test for open_dataset_async equivalence
aladinor and others added 7 commits March 25, 2026 11:30
…r safety

- Replace custom _iter_zarr_groups_async (~90 lines) with zarr's
  AsyncGroup.members(max_depth=None) to avoid sync fallback deadlock
- Wrap _get_open_params and _build_group_members in run_in_executor
  in open_store_async to prevent reentrant sync() deadlock
- Add dedicated executor to _maybe_create_default_indexes_async and
  create_indexes_async to avoid thread pool exhaustion on zarr's IO loop
Add try/except around zarr.core.group import so zarr v2 (where
zarr.core is a module, not a package) falls back to sync discovery.
The zarr v3 async backend was creating indexes inside
_open_datatree_from_stores_async, but the framework layer
(_datatree_from_backend_datatree in api.py) already handles
index creation for all backends. This caused indexes to be
created twice per node and made create_default_indexes=False
silently ignored.

Remove the index creation from the backend to restore the
clean main architecture: backends open datasets without
indexes, the framework creates them once.
Replace the per-group thread+zarr_sync pattern in
_open_datatree_from_stores_async with a two-phase approach:

Phase 1: Single async_root.members(max_depth=None) call discovers
all groups AND their array members in one pass, replacing both
_iter_zarr_groups_async and per-group _fetch_members calls.

Phase 2: Wrap AsyncArray/AsyncGroup in sync Array/Group (zero-cost),
inject pre-fetched members into ZarrStore, run only CPU-bound
decode_cf_variables in thread pool.

Results (laptop, 60-node OSN store):
- zarr_sync calls: 122 → 2
- members() calls: 61 → 1
- ~15-20% faster open_datatree (2.08s → 1.77s with indexes)
… loading

When the group parameter contains glob metacharacters (*, ?, [), filter
which groups are opened instead of re-rooting the tree. This avoids
loading the entire hierarchy when only a subset is needed.

Adds shared utilities _is_glob_pattern, _filter_group_paths, and
_resolve_group_and_filter in common.py. Updates NetCDF4, H5NetCDF, and
Zarr backends to use the discover-filter-open pipeline. Includes unit
tests for the utilities and integration tests across all backends.

Closes pydata#11196
@aladinor aladinor force-pushed the glob-group-filtering branch from 9779596 to ebf203e Compare March 25, 2026 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

io topic-backends topic-zarr Related to zarr storage library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support glob patterns in open_datatree(group=...) for selective group loading

1 participant