Support glob patterns in `open_datatree(group=...)` for selective group loading by aladinor · Pull Request #11197 · pydata/xarray

aladinor · 2026-02-22T23:56:15Z

Note: This PR depends on #10742 (async DataTree open) being merged first. It will have merge conflicts until then.

When the group parameter contains glob metacharacters (*, ?, [), filter which groups are opened instead of re-rooting the tree. This avoids loading the entire hierarchy when only a subset is needed.

Use cases

Radar data: xr.open_datatree("radar.nc", group="*/sweep_0") — load only the lowest elevation sweep from each volume scan
CMIP archives: xr.open_datatree("cmip.zarr", group="*/historical/tas") — load only temperature across all models

Changes

Added shared utilities _is_glob_pattern, _filter_group_paths, and _resolve_group_and_filter in common.py
Updated NetCDF4, H5NetCDF, and Zarr backends to use a discover → filter → open pipeline
Uses the same matching engine as DataTree.match() (PurePosixPath.match)
Root (/) and all ancestors of matched nodes are always included to form a valid tree

Behavior summary

`group` value	Behavior
`None`	Load all groups (unchanged)
`"VCP-34"` (no glob chars)	Root selection (unchanged)
`"*/sweep_0"` (glob chars)	Filter mode — only matched groups + ancestors
Pattern matches nothing	Root-only tree

Closes Support glob patterns in open_datatree(group=...) for selective group loading #11196
Tests added
Depends on Implement async support for open_datatree #10742 being merged first
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

… async

Changes: - Refactor open_datatree() to use zarr_sync() with async implementation for concurrent dataset and index creation across groups - Add _open_datatree_from_stores_async() helper that opens datasets and creates indexes concurrently using asyncio.gather with a semaphore to limit concurrency (avoids deadlocks with stores like Icechunk) - Add open_datatree_async() method for explicit async API - Remove duplicate _maybe_create_default_indexes_async from zarr.py, now imports from api.py (single source of truth) This significantly improves performance when opening DataTrees from high-latency storage backends (e.g., ~2 seconds vs sequential loading).

Remove the asyncio.Semaphore that was limiting concurrency to 10 concurrent operations. Investigation showed: - Zarr already has built-in concurrency control (async.concurrency=10) - The semaphore only applied to asyncio.to_thread() calls, not zarr I/O - Removing it improves performance by ~30-40% (~2s -> ~1.2-1.4s) The semaphore was defensive code for a problem that doesn't exist - zarr and icechunk handle their own concurrency limits internally.

The async implementation uses zarr.core.sync which only exists in zarr v3. Add a conditional check using _zarr_v3() to: - Use async path with zarr_sync() for zarr v3 (concurrent loading) - Fall back to sequential loading for zarr v2 This fixes CI failures on min-versions environment which uses zarr v2.

The previous commit (0ee2a73) removed the semaphore thinking zarr handles its own concurrency, but icechunk can deadlock when too many asyncio.to_thread() calls try to access it simultaneously. This was discovered when testing with larger stores (23+ groups) where all threads would start but never complete. The semaphore limits concurrent to_thread calls to 10, which prevents the deadlock while still providing significant performance benefits over sequential loading.

- Add helper methods _build_group_members and _create_stores_from_members to reduce code duplication between sync and async store opening - Use zarr_sync() to run async index creation in _datatree_from_backend_datatree for zarr engine, making open_datatree fully async behind the scenes - Fix missing chunks validation and source encoding in open_datatree_async - Add tests for chunks validation, source encoding, and chunks parameter

- Add type annotations to nested async functions in _datatree_from_backend_datatree to fix mypy annotation-unchecked notes breaking pytest-mypy-plugins tests - Use os.path.join and os.path.normpath in test_async_source_encoding for cross-platform compatibility on Windows

Add type annotations to _maybe_create_default_indexes_async and its nested functions (load_var, create_index, _create) to satisfy mypy's annotation-unchecked checks. Also add Variable and Hashable imports to the TYPE_CHECKING block. This fixes pytest-mypy-plugins tests that were failing due to mypy emitting annotation-unchecked notes for untyped nested functions.

- Remove open_datatree_async() from api.py (public API) - Remove open_datatree_async() from zarr.py (backend method) - Keep internal async optimization in _datatree_from_backend_datatree() - Use _zarr_v3() for proper zarr version check instead of ImportError - Update tests to only test internal async functionality - Add test to verify sync open_datatree uses async internally for zarr v3 The async optimization is now internal only - users call the sync open_datatree() which automatically uses async index creation for zarr v3 backends.

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

Benchmarking showed async index creation provides no measurable benefit since it's CPU-bound work. Simplified to sync loop per reviewer feedback.

for more information, see https://pre-commit.ci

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

for more information, see https://pre-commit.ci

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

for more information, see https://pre-commit.ci

- Replace asyncio.gather with asyncio.TaskGroup for better error handling (cancels outstanding tasks on error) - Add max_concurrency parameter to open_datatree for controlling parallel I/O operations (defaults to 10) - Add StoreBackendEntrypoint.open_dataset_async method - Add test for open_dataset_async equivalence

…r safety - Replace custom _iter_zarr_groups_async (~90 lines) with zarr's AsyncGroup.members(max_depth=None) to avoid sync fallback deadlock - Wrap _get_open_params and _build_group_members in run_in_executor in open_store_async to prevent reentrant sync() deadlock - Add dedicated executor to _maybe_create_default_indexes_async and create_indexes_async to avoid thread pool exhaustion on zarr's IO loop

for more information, see https://pre-commit.ci

Add try/except around zarr.core.group import so zarr v2 (where zarr.core is a module, not a package) falls back to sync discovery.

The zarr v3 async backend was creating indexes inside _open_datatree_from_stores_async, but the framework layer (_datatree_from_backend_datatree in api.py) already handles index creation for all backends. This caused indexes to be created twice per node and made create_default_indexes=False silently ignored. Remove the index creation from the backend to restore the clean main architecture: backends open datasets without indexes, the framework creates them once.

Replace the per-group thread+zarr_sync pattern in _open_datatree_from_stores_async with a two-phase approach: Phase 1: Single async_root.members(max_depth=None) call discovers all groups AND their array members in one pass, replacing both _iter_zarr_groups_async and per-group _fetch_members calls. Phase 2: Wrap AsyncArray/AsyncGroup in sync Array/Group (zero-cost), inject pre-fetched members into ZarrStore, run only CPU-bound decode_cf_variables in thread pool. Results (laptop, 60-node OSN store): - zarr_sync calls: 122 → 2 - members() calls: 61 → 1 - ~15-20% faster open_datatree (2.08s → 1.77s with indexes)

… loading When the group parameter contains glob metacharacters (*, ?, [), filter which groups are opened instead of re-rooting the tree. This avoids loading the entire hierarchy when only a subset is needed. Adds shared utilities _is_glob_pattern, _filter_group_paths, and _resolve_group_and_filter in common.py. Updates NetCDF4, H5NetCDF, and Zarr backends to use the discover-filter-open pipeline. Includes unit tests for the utilities and integration tests across all backends. Closes pydata#11196

for more information, see https://pre-commit.ci

github-actions bot added topic-backends topic-zarr Related to zarr storage library io labels Feb 22, 2026

aladinor and others added 27 commits March 25, 2026 11:30

adding async for datatrees

8f6d4a7

adding async method to _maybe_create_index

77d4357

using async as complete instead of gathering results

abb5f8d

adding tests for open_group, open_dtree and _maybe_create_index using…

0c9c66c

… async

ensuing _maybe_create_default_indexes_async is compatible with zarr v2

7e174bf

resolving the mypy type errors

182c794

attemp 2: resolving mypy type errors

db10454

updating whats-new.rst file

c2cb527

fix: add type ignore for mypy arg-type error in open_datatree_async

b0a1e5f

refactor: convert _build_group_members to module-level helper function

d68d44c

fix: add cast for mypy type checking in _build_group_members

6f78621

Update xarray/backends/api.py

16a5558

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

refactor: use sync index creation in _maybe_create_default_indexes_async

df8b61f

Benchmarking showed async index creation provides no measurable benefit since it's CPU-bound work. Simplified to sync loop per reviewer feedback.

[pre-commit.ci] auto fixes from pre-commit.com hooks

05afdae

for more information, see https://pre-commit.ci

Update xarray/backends/api.py

d4880b5

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

6a4e87a

for more information, see https://pre-commit.ci

Update xarray/backends/api.py

89d9721

Co-authored-by: Justus Magin <keewis@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

980c882

for more information, see https://pre-commit.ci

aladinor and others added 7 commits March 25, 2026 11:30

[pre-commit.ci] auto fixes from pre-commit.com hooks

f5cb80c

for more information, see https://pre-commit.ci

Fix zarr v2 fallback in _iter_zarr_groups_async

cd7bd60

Add try/except around zarr.core.group import so zarr v2 (where zarr.core is a module, not a package) falls back to sync discovery.

Trim verbose comments to match xarray style

15203d9

aladinor force-pushed the glob-group-filtering branch from 9779596 to ebf203e Compare March 25, 2026 16:39

[pre-commit.ci] auto fixes from pre-commit.com hooks

31174a7

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support glob patterns in `open_datatree(group=...)` for selective group loading#11197

Support glob patterns in `open_datatree(group=...)` for selective group loading#11197
aladinor wants to merge 35 commits intopydata:mainfrom
aladinor:glob-group-filtering

aladinor commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

aladinor commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant