DM-55041: Add Zarr file I/O#45
Open
timj wants to merge 68 commits into
Open
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #45 +/- ##
==========================================
+ Coverage 73.58% 76.27% +2.69%
==========================================
Files 89 107 +18
Lines 10816 12938 +2122
==========================================
+ Hits 7959 9869 +1910
- Misses 2857 3069 +212 ☔ View full report in Codecov by Sentry. |
99fa1f2 to
5b085eb
Compare
Captures the agreed-on architecture for adding a zarr backend alongside the existing FITS/JSON/NDF backends: cloud-first OME-Zarr v0.5 layout with namespaced LSST extensions, IR-driven two-pass writes mirroring the NDF approach, and recursive composition for image-shaped sub-archives. Generated with AI Co-Authored-By: SLAC AI
Generated with AI Co-Authored-By: SLAC AI
Pivots the on-disk layout from the v1 OME-multiscale-image-with-lsst- companions structure to an xarray/CF-shaped root with image, variance, mask as siblings; OME multiscales metadata points at the same image array (no byte duplication). Mask becomes a 2-D packed-integer array with CF flag_masks/flag_meanings/flag_descriptions for native geospatial-tool interop. ColorImage stops stacking and writes its channels as recursive sub-archives. WCS is affine-only OME plus authoritative AST string in v1; RFC-5 nonlinear transformations are a follow-up blocked on writing an AST JSON channel. Adds an 11x11-grid residual validator that drops the simplified affine when the max pixel-equivalent error exceeds 1 pixel. Generated with AI Co-Authored-By: SLAC AI
NCZarr is purely additive on top of the v1 layout (_NCZARR_GROUP / _NCZARR_ARRAY markers + optional 1-D coordinate variables). Held out of v1 because the zarr-v3 mapping is still evolving; recorded so we don't lose the conclusion. Generated with AI Co-Authored-By: SLAC AI
Replaces the v1 plan (commit 51a1e3a) with a plan that matches the revised spec (commits 83b9064 + a34369d): xarray/CF-shaped root with OME-NGFF as a discoverability layer, 2-D packed-integer mask with CF flag attrs, ColorImage as recursive sub-archives (no stacking), CellCoadd with native 4-D PSF (no fixup pass), affine residual validator with 1-pixel threshold, and the AST string as the WCS round-trip authority. Six phases, ~37 bite-sized TDD tasks, every critical invariant pinned by a failing-then-passing test. Generated with AI Co-Authored-By: SLAC AI
Generated with AI Co-Authored-By: SLAC AI
Generated with AI Co-Authored-By: SLAC AI
Generated with AI Co-Authored-By: SLAC AI
…ations Also extends the AST bridge in _transforms/_ast.py with ZoomMap and PolyMap so the validator's tests can construct synthetic linear and distorted FrameSets. Generated with AI Co-Authored-By: SLAC AI
Following project convention, imports go at the top of the module rather than inside function bodies. PolyMap's numpy use and affine_check's numpy use are hoisted; tests likewise hoist their _ast imports. Generated with AI Co-Authored-By: SLAC AI
…ame_set Generated with AI Co-Authored-By: SLAC AI
…ag attrs Generated with AI Co-Authored-By: SLAC AI
Generated with AI Co-Authored-By: SLAC AI
Generated with AI Co-Authored-By: SLAC AI
…age on-disk shape Mask.serialize emits the byte axis first when the archive opts in via _prefer_native_mask_arrays. The output archive undoes that swap so the on-disk layout is the natural xarray (y, x) shape with a 2-D packed wide-integer per pixel. Adds an end-to-end on-disk shape test for MaskedImage that pins this contract (Tasks 2.8 + 3.0 from the plan). Generated with AI Co-Authored-By: SLAC AI
Generated with AI Co-Authored-By: SLAC AI
…unpack Generated with AI Co-Authored-By: SLAC AI
… public read() Generated with AI Co-Authored-By: SLAC AI
…mage Round-trips Image, MaskedImage (3-plane), and MaskedImage (40-plane) through the zarr backend via the new RoundtripZarr helper. RoundtripZarr overrides _run_without_butler to use a TemporaryDirectory since zarr archives are directories, not single files. Generated with AI Co-Authored-By: SLAC AI
…gle-cell chunks - Add decorate_sub_archives that walks the IR and adds lsst.archive_class + ome.multiscales to any sub-group containing an image array (Phase 4.1). - Pin ColorImage write/round-trip behaviour: each channel becomes a 2-D array at root with no nested sub-archive, since ColorImage.serialize produces flat per-channel arrays (Phase 4.2). - Default a 4-D psf array to single-cell chunks (1, 1, Py, Px) for CellCoadd-style PSF storage (Phase 4.3). CellCoadd-specific layout and round-trip tests (Phase 4.4, 4.5) are deferred as the plan's _make_minimal_cell_coadd factories are implementer-supplied placeholders. Generated with AI Co-Authored-By: SLAC AI
Seven-task TDD plan covering: lower chunk default 1024->256, add DEFAULT_TARGET_SHARD_BYTES env-var-driven constant, add default_shards helper with byte-budget rule, wire it into ZarrOutputArchive.add_array, integration tests for image / PSF / zip-store round-trips, and a final regression sweep. Generated with AI Co-Authored-By: SLAC AI
Add @unittest.skipUnless(HAVE_ZARR, "zarr is not installed") to TargetShardBytesEnvVarTestCase to match the pattern used by CommonTestCase. Without this guard the subprocess tests fail for the wrong reason in environments where zarr is not installed, because the imported module raises ImportError before the env-var logic runs. Also add a second assertion in test_garbage_value_fails_at_import to check for "is not a base-10 integer" in stderr, ensuring the error originates from _read_target_shard_bytes rather than an unrelated future warning that happens to mention the env-var name. Generated with AI Co-Authored-By: SLAC AI
The max(1, ...) guard in default_shards was unreachable: the function returns None before reaching that line when chunk_bytes >= target_bytes, so ratio > 1 always holds and round(ratio**(1/n)) >= 1 is guaranteed. The clamp was misleading — it implied k could be 0 or negative, which undermined the subsequent if k <= 1 guard. Drop the clamp and add an inline comment on the guard explaining the scenario it actually catches (rounding to a 1x no-op multiplier). Also expand the Returns-None paragraph in the docstring to mention the dtype.itemsize == 0 branch that was previously omitted. Generated with AI Co-Authored-By: SLAC AI
The `chunks` description referred to a "~1024 per axis" default that has been 256 since Task 1. The `shards` description claimed `to_zarr` derives a 4x-chunk default, but shard defaulting actually lives in `ZarrOutputArchive.add_array` via `default_shards`. Replace both parameter docstrings with accurate descriptions. Generated with AI Co-Authored-By: SLAC AI
Update the "Cloud-friendly defaults" docstring bullet in the zarr __init__ to reflect the 256 chunk cap (was 1024), reference the DEFAULT_CHUNK_AXIS_LIMIT constant, and add a new bullet describing automatic v3 sharding with the byte-budget rule and env-var escape hatch. Also drop the now-redundant max(1, ...) guard from the default_shards code sketch in the sharding design spec. Generated with AI Co-Authored-By: SLAC AI
Two compounding bugs caused every group's zarr.json to be written twice into a .zarr.zip, triggering zipfile UserWarning on every duplicate entry: 1. _group_to_zarr called create_group(name) followed by update_attributes(dumped). The IR is fully populated before write, so route attributes through create_group(attributes=...) and create_array(attributes=...) instead of a follow-up update. 2. The post-write consolidate_metadata call was wrapped in a stale except TypeError that no longer fires on modern zarr. Skip consolidation for ZipStore: a zip is read end-to-end locally so consolidation buys nothing, and rewriting the root zarr.json into an append-only zip emits a spurious "Duplicate name" warning plus the spec-not-finalized warning. Generated with AI Co-Authored-By: SLAC AI
The pybind11 v3 bindings shipping in astshim with lsst_distrib are stricter than the v2 bindings the layout code was written against: - PolyMap(coeff_f, ...) no longer auto-converts a nested Python list to numpy.ndarray; the test now passes the coefficients as an ndarray. - Mapping.applyForward(xy) rejects non-C-contiguous arrays; _frame_set_apply was passing pixels.T (F-contiguous view) and now wraps it in np.ascontiguousarray. While clearing those, affine_check also tripped over a backend parity gap: astshim's linearApprox returns a (1 + Nout, Nin) ndarray and raises RuntimeError on no-fit, but the starlink-pyast fallback wrapper returned a flat array or None. The shim's job is to make every backend look like astshim, so: - The pyast linearApprox wrapper now reshapes the AST flat buffer to the astshim (1 + Nout, Nin) layout and raises RuntimeError instead of returning None. - affine_check is rewritten against the astshim contract: catch RuntimeError for the drop case and read offsets from row 0 and the Jacobian from rows 1..Nout of the 2-D return. That rewrite also corrects a latent bug in the previous column-major-flat unpacking: pyast actually returns coefficients in row-major-per-output order, so the old c0, c1, j00, j10, j01, j11 = coeffs was silently transposing the off-diagonal Jacobian terms. The pure-ZoomMap test has a diagonal Jacobian and so did not catch it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Up to now the only ''do two backends agree?'' tests were `test_fits_ndf_consistency` in test_image, test_masked_image, and test_visit_image -- so FITS-vs-zarr and FITS-vs-JSON drift was not covered for any of the image types. Add ten new tests in five files that close that gap by reading the same object back through two backends and comparing the results to each other and to the original: * test_image: `test_fits_zarr_consistency`, `test_fits_json_consistency` * test_masked_image: same two * test_color_image: same two (using `self.assert_color_images_equal`) * test_visit_image: same two (using `assert_visit_images_equal`) * test_cell_coadd: same two (using `assert_cell_coadds_equal`); CellCoadd is excluded from NDF because NDF cannot represent it. All ten tests run on the non-butler path on purpose -- the goal is to compare two I/O layers head-to-head, not to also test the butler wrapper. RoundtripJson / RoundtripZarr imports and a HAVE_ZARR skip gate are added in the files that did not already have them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add direct unit tests for ZarrOutputArchive.add_structured_array mirroring the NDF coverage, and extend test_piff to round-trip the Piff PSF through RoundtripNdf and RoundtripZarr in addition to the existing FITS/JSON cases. Closes the coverage gap on the only production caller of add_structured_array (the Piff writer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This uses xarray / CF / OME conventions where appropriate.
Checklist
doc/changes