Skip to content

BF: Make Zarr validation/digest tests work with zarr-python 2.x and 3.x#1858

Merged
yarikoptic-gitmate merged 5 commits into
masterfrom
bf-zarr
May 21, 2026
Merged

BF: Make Zarr validation/digest tests work with zarr-python 2.x and 3.x#1858
yarikoptic-gitmate merged 5 commits into
masterfrom
bf-zarr

Conversation

@yarikoptic
Copy link
Copy Markdown
Member

We (me with @kabilar ) did raise upper version for zarr but it was not in effect due to hdmf-zarr blocking it via nwbexpector dependency for us. That was addressed and thus we started to upgrade into that zarr 3.

zarr-python 3.x changed both the public API and on-disk format:

  • zarr.errors.PathNotFoundError was removed; missing-metadata errors are now GroupNotFoundError / ArrayNotFoundError / NodeNotFoundError.
  • bool(group) / len(group) eagerly iterate members and raise on non-Zarr files (e.g. an arbitrary a/b/c/... tree) or arrays with metadata the new library cannot parse (e.g. legacy >u1 dtypes).
  • zarr.open() can partially succeed on malformed V3 stores where root metadata is valid but child metadata is not, which would bypass the tensorstore-based V3 validator.
  • Default serialisation is now V3 (zarr.json, c/<chunk> layout, different default compressor), so on-disk digests of "the same" data differ between V2 and V3 outputs.

dandi/files/zarr.py:

  • Detect the Zarr format version from disk up front and always route V3 stores through _ts_validate_zarr3, regardless of zarr-python version.
  • Add _is_empty_group helper that swallows exceptions from zarr-python 3.x's eager member iteration — there is some content, even if it's not a valid Zarr member, so don't spuriously report dandi_zarr.empty_group.
  • Drop the explicit zarr.errors.PathNotFoundError clause (which only exists in 2.x) in favour of a single generic except Exception after the V3 fast-path.

Tests:

  • test_digest_zarr, test_digest_zarr_with_excluded_dotfiles and test_zarr_properties now key their expected digests/sizes on the Zarr serialisation format actually produced (get_zarr_format_version on the written tree) rather than on zarr.__version__. This stays correct if zarr-python 3.x is later configured to write zarr_format=2.

Verified against zarr 2.18.7 and zarr 3.1.5.

zarr-python 3.x changed both the public API and on-disk format:

* `zarr.errors.PathNotFoundError` was removed; missing-metadata errors are
  now `GroupNotFoundError` / `ArrayNotFoundError` / `NodeNotFoundError`.
* `bool(group)` / `len(group)` eagerly iterate members and raise on
  non-Zarr files (e.g. an arbitrary `a/b/c/...` tree) or arrays with
  metadata the new library cannot parse (e.g. legacy `>u1` dtypes).
* `zarr.open()` can partially succeed on malformed V3 stores where root
  metadata is valid but child metadata is not, which would bypass the
  tensorstore-based V3 validator.
* Default serialisation is now V3 (`zarr.json`, `c/<chunk>` layout,
  different default compressor), so on-disk digests of "the same" data
  differ between V2 and V3 outputs.

`dandi/files/zarr.py`:

* Detect the Zarr format version from disk up front and always route V3
  stores through `_ts_validate_zarr3`, regardless of zarr-python version.
* Add `_is_empty_group` helper that swallows exceptions from zarr-python
  3.x's eager member iteration — there is *some* content, even if it's
  not a valid Zarr member, so don't spuriously report
  `dandi_zarr.empty_group`.
* Drop the explicit `zarr.errors.PathNotFoundError` clause (which only
  exists in 2.x) in favour of a single generic `except Exception` after
  the V3 fast-path.

Tests:

* `test_digest_zarr`, `test_digest_zarr_with_excluded_dotfiles` and
  `test_zarr_properties` now key their expected digests/sizes on the
  Zarr serialisation format actually produced (`get_zarr_format_version`
  on the written tree) rather than on `zarr.__version__`. This stays
  correct if zarr-python 3.x is later configured to write `zarr_format=2`.

Verified against zarr 2.18.7 and zarr 3.1.5.

Co-Authored-By: Claude Code 2.1.144 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yarikoptic yarikoptic added release Create a release when this pr is merged zarr labels May 19, 2026
Comment thread dandi/files/zarr.py Dismissed
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 98.50746% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 76.38%. Comparing base (d69a282) to head (f1aac97).

Files with missing lines Patch % Lines
dandi/files/zarr.py 94.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1858      +/-   ##
==========================================
+ Coverage   76.26%   76.38%   +0.12%     
==========================================
  Files          87       87              
  Lines       12512    12543      +31     
==========================================
+ Hits         9542     9581      +39     
+ Misses       2970     2962       -8     
Flag Coverage Δ
unittests 76.38% <98.50%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yarikoptic yarikoptic added the patch Increment the patch version when merged label May 20, 2026
yarikoptic and others added 4 commits May 19, 2026 22:40
Follow-up to eba4884. Two issues remained when running against zarr-python 3.x:

1. **typing**: `_ZARR_PROPERTIES_EXPECTED[fmt]` failed mypy because
   `get_zarr_format_version()` returns `Optional[str]`.

2. **upload/download tests**: hardcoded the V2 on-disk layout
   (``.zgroup`` / ``arr_X/.zarray`` / ``arr_X/0``), so they failed on
   zarr-python 3.x which writes V3 (``zarr.json`` / ``arr_X/zarr.json``
   / ``arr_X/c/0``).

`dandi/tests/test_files.py`:

* Add a `zarr_format_of(path) -> str` test helper that wraps
  `get_zarr_format_version` and asserts the result is non-None (fixes
  the mypy index error in `test_zarr_properties`).
* Add `_TWO_ARRAY_ZARR_LAYOUT` — a `TypedDict`-keyed mapping from format
  version ("2" / "3") to the expected on-disk layout of the canonical
  ``zarr.save(p, arr_0, arr_1)`` two-array sample. Used by all
  upload/download tests below.
* Rewrite `test_upload_zarr`, `test_upload_zarr_with_excluded_dotfiles`,
  and `test_upload_zarr_entry_content_type` to derive expected entry
  paths and the root-metadata filename from the layout dict rather than
  hardcoding V2-only paths.

`dandi/tests/test_download.py`:

* Reuse `_TWO_ARRAY_ZARR_LAYOUT` / `zarr_format_of` for
  `test_download_different_zarr_onto_excluded_dotfiles` and
  `test_download_multiple_urls`, both of which check exact post-download
  trees of the two-array sample fixture.
* In `test_download_different_zarr_delete_dir` gate the "no subdirs"
  sanity check on V2 — the ``np.eye(5)`` Zarr is flat in V2 but has a
  ``c/`` chunk directory in V3. The downstream `assert_dirtrees_eq`
  still exercises the actual download-replaces-tree behaviour in both.

`dandi/cli/tests/test_digest.py`: same `zarr_format_of` helper applied
to the digest tests so they share one source of truth for format
detection.

Verified against zarr 2.18.3 and 3.1.5 locally; mypy clean (no new
errors); flake8 clean.

Co-Authored-By: Claude Code 2.1.144 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With zarr-python 3.x installed, mypy picks up the bundled type stubs
and the `NDArrayLike` protocol rejects `numpy.ndarray` because of minor
divergence in `ndarray.reshape` overloads. This surfaces as ~30
`arg-type` errors on `zarr.save(path, np.arange(...))` calls across the
test suite (none of them new — the same calls have existed for years).

zarr-python 2.x ships no stubs so the same calls are silently untyped,
which is why CI typing (currently resolving to zarr 2.18) passes while
running `tox -e typing` locally against zarr 3.x does not.

Add a `[[tool.mypy.overrides]]` entry pinning `follow_imports = "skip"`
for `zarr.*`, so zarr is treated as untyped regardless of which
zarr-python version is installed. This matches the effective CI
behaviour and removes the version-dependent typing noise.

After this change, `mypy dandi` reports zero errors locally against
zarr 3.1.5.

Co-Authored-By: Claude Code 2.1.144 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ent-Type

``UploadItem.from_entry`` only recognised the Zarr V2 metadata filenames
(``.zarray`` / ``.zattrs`` / ``.zgroup`` / ``.zmetadata``) as JSON and
set ``Content-Type: application/json`` on the upload accordingly. For
V3 stores written by zarr-python 3.x the per-node metadata lives in a
single ``zarr.json`` file at each group/array root; the previous code
left ``content_type = None`` for those, so the server stored them as
``binary/octet-stream`` — visible in ``test_upload_zarr_entry_content_type``
on zarr-python 3.x:

    AssertionError: assert 'binary/octet-stream' == 'application/json'

Add ``"zarr.json"`` to the JSON-metadata set so V3 metadata files get
the correct Content-Type on upload, matching long-standing V2 behaviour.

Co-Authored-By: Claude Code 2.1.144 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code review (eba4884..5f87b6a) flagged a few rough edges. None
change behaviour:

* Move `zarr_format_of` and the `TWO_ARRAY_ZARR_LAYOUT` mapping from
  `dandi/tests/test_files.py` to `dandi/tests/test_helpers.py` and drop
  the leading underscore from the now genuinely cross-module name. The
  previous arrangement cross-imported between test modules (including
  across packages from `dandi/cli/tests/test_digest.py`), which would
  silently break on rename or pytest collection-order changes. The
  underscore-prefixed name was also misleading since the symbol was
  imported elsewhere. `test_helpers.py` is the existing pattern for
  shared test utilities — keeping the `test_*.py` filename so pytest
  preserves assertion rewriting on the `assert` inside `zarr_format_of`.

* Collapse the two byte-identical `ValidationResult` arms in
  `ZarrAsset.get_validation_errors` into one append, with the message
  chosen via a conditional. The two arms differed only in the message
  string and risked drift over time.

* Add a `lgr.debug` line in `_is_empty_group`'s broad-`except` so that
  a real bug (as opposed to the expected zarr-python 3.x eager-iteration
  raise) is at least discoverable in debug logs rather than silently
  swallowed.

* Tighten `test_download_different_zarr_delete_dir` to assert the V3
  invariant (`initial_subdirs == {"c"}`) instead of just skipping the
  no-subdirs check on V3. The test still exercises the contrast
  between the smaller single-array Zarr and the bigger two-array one
  that the download must reconcile.

Verified mypy clean against zarr 3.1.5; pytest passes against zarr
2.18.7 and 3.1.5; pre-commit hooks all pass.

Co-Authored-By: Claude Code 2.1.144 / Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yarikoptic-gitmate yarikoptic-gitmate merged commit 27caa41 into master May 21, 2026
40 checks passed
@yarikoptic-gitmate yarikoptic-gitmate deleted the bf-zarr branch May 21, 2026 01:51
@github-actions
Copy link
Copy Markdown

🚀 PR was released in 1.0.0 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

patch Increment the patch version when merged release Create a release when this pr is merged released zarr

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants