Skip to content

[lazy indexing] transforms #3956

Open
d-v-b wants to merge 26 commits intozarr-developers:mainfrom
d-v-b:worktree-lazy-indexing-pr1-transforms
Open

[lazy indexing] transforms #3956
d-v-b wants to merge 26 commits intozarr-developers:mainfrom
d-v-b:worktree-lazy-indexing-pr1-transforms

Conversation

@d-v-b
Copy link
Copy Markdown
Contributor

@d-v-b d-v-b commented May 8, 2026

This is a claude-assissted PR derived from #3906 that breaks out the index transformation functionality into its own module.

Ultimately we will model indexing an array as a mapping from the source indices to a set of output indices via an index transformation. The data structures here define the pieces of that model.

  • IndexDomain is a model of a contiguous domain of n-dimensional array indices. This tracks the domain of an array that is being indexed. It's basically the set of valid coordinates for an indexable array, represented implicitly.
  • OutputIndexMap is a union of types that model where, in an output (like a chunk), the indexing operation will send array elements.
  • An IndexTransform is a mapping from an IndexDomain to a collection of OutputIndexMap elements. In a later PR, every Zarr array will be associated with an IndexTransform, which tracks how the array's indices are "projected" to stored indices.

d-v-b added 20 commits May 6, 2026 17:46
Empty package, no exports yet. Subsequent commits port modules
from PR zarr-developers#3906's transforms package and rewire imports to the
private name.
Defines ConstantMap, DimensionMap, ArrayMap, and the OutputIndexMap
union. No in-package dependencies.
Defines IndexDomain (a rectangular integer-coordinate region) plus
the _normalize_selection helper used by transform construction.
Defines IndexTransform plus selection_to_transform and the per-mode
application functions (_apply_basic_indexing, _apply_oindex,
_apply_vindex). This is the core of the package.
Defines compose(outer, inner) -> IndexTransform and the per-output-map
dispatch helpers.
Defines iter_chunk_transforms and sub_transform_to_selections, which
bridge a transform to per-chunk selections. No callers in this PR;
included so the package is internally complete and ready for future
internal-rewiring work.
Re-export IndexDomain, IndexTransform, the OutputIndexMap variants,
and compose from the package root. JSON helpers are intentionally
excluded; chunk_resolution exports are intentionally not re-exported
(no callers in this PR).
…ssor reference

The docstrings inherited RST-style double-backticks from the reference
branch. Convert to markdown single-backticks throughout (zarr-python's
docs are markdown-rendered, so the doubles render as literal `` text).

Also drop the stale "Every Array holds a transform" / "Array.z[...]"
comment from transform.py — those statements describe an earlier design
where the transform lived on Array; the actual lazy view design parks
the transform on a separate wrapper type.
Replace class-based test grouping with top-level parametrized
functions. Each test covers either all success branches or all
error branches of a single concern.

Adds tests/test_transforms/conftest.py with project-style
`Expect[TIn, TOut]` and `ExpectErr[TIn]` test-case helpers.

The new mutation_errors test actually exercises FrozenInstanceError
(the previous test_frozen variants only checked isinstance, which
the @DataClass decorator guarantees regardless of frozen=True).
One success test and (where applicable) one error test per public
function on IndexDomain. Adds direct tests for the non-trivial private
helper _normalize_selection, including the previously-untested
double-ellipsis branch.

Coverage: 34 tests → 53 tests.
The function bridged NumPy-style negative indices to TensorStore-style
absolute coordinates. It was originally used by the eager-path rewiring
in PR zarr-developers#3906 (4 callsites in src/zarr/core/array.py). This PR has no
such rewiring — PR #2's lazy view materializes via the existing eager
indexing path, which already handles negatives — so the helper has no
callers in our shipping plan.

Re-add when (and if) the internal-rewiring follow-up arrives. Removing
it now keeps the package surface lean and the diff focused.
One success test (and where applicable one error test) per public
method on IndexTransform plus the public selection_to_transform
function.

Adds direct tests for _intersect_vectorized — the correlated
multi-ArrayMap intersection path. Public `intersect` only reaches it
when the transform has 2+ ArrayMap outputs; all the public-surface
intersect cases use a single ArrayMap, so this branch was previously
unreached by tests.
Six (inner_kind, outer_kind) pairs for the compose dispatch matrix
collapse from six per-case methods into one parametrized test. The
non-trivial private helpers (_compose_single, _compose_dimension,
_compose_array) are reached transitively via this matrix and need
no direct tests.
iter_chunk_transforms cases parametrized over (transform, grid)
inputs; sub_transform_to_selections cases parametrized over the
output-map-kind matrix.

Adds 3 cases that exercise the previously-untested out_indices
parameter on sub_transform_to_selections: orthogonal-array-with-
out_indices, vectorized-with-out_indices, and the implicit
fallback when out_indices is None.
When mypy runs over the rewritten test files together (rather than
one at a time), it tightens up two type checks that were lax in
isolation:

- selection_to_transform's mode parameter is a Literal, not str.
- assert_array_equal expects ndarray, but out_sel[0] is typed as
  slice | ndarray. Add an isinstance check.

These were latent issues in the rewrites; they only surfaced when
mypy saw all the test files at once.
…main.py

Missed by the earlier polish commit (fca2926): the docstring at the
top of domain.py used an RST `::` indented code block. Convert to a
fenced markdown block to match the rest of the package's docstrings.
…nches

Code review surfaced six public functions/methods with reachable error
branches and no test coverage. Add parametrized error tests (or
standalone tests, where there is one trivial input) to close the gaps:

- IndexTransform.oindex: negative slice step raises IndexError.
- IndexTransform.vindex: negative slice step raises IndexError.
- IndexTransform.intersect: rank-mismatched output_domain raises ValueError.
- IndexTransform.translate: wrong-length shift raises ValueError.
- selection_to_transform: unknown mode string raises ValueError.
- compose: 2D ArrayMap inner with non-constant outer raises NotImplementedError.

Test count: 145 → 152.
…vity

An IndexTransform models a function from input coords to output coords;
function composition is associative by definition. This test verifies
that the implementation preserves that algebraic property by sampling
random affine triples (a, b, c) with compatible ranks and checking that

    compose(compose(a, b), c)

evaluates the same as

    compose(a, compose(b, c))

at random points in a's domain. 200 hypothesis examples; passes cleanly.

Restricted to DimensionMap + ConstantMap outputs (the affine subset).
ArrayMap composition has implementation-level branching that depends on
outer structure and would need a more careful generator to avoid the
NotImplementedError path; the affine case is the algebraic core, the
ArrayMap case is deferred to a follow-up.
Previously, ArrayMap relied on an implicit convention to encode which input
dims it was parameterized by: the array shape was matched against the input
domain's shape by length, and "correlation" between two ArrayMaps (vectorized
indexing) was inferred from "two ArrayMaps exist in the same transform."

This refactor makes parameterization explicit:

- ArrayMap gains `input_dimensions: tuple[int, ...]` as a required field.
  Each axis of the index_array corresponds to one entry in input_dimensions,
  in order.
- IndexTransform.__post_init__ enforces that ArrayMap.index_array.shape
  equals the input domain's extent on input_dimensions. Out-of-range and
  duplicate input dims are also rejected.
- Two ArrayMaps are correlated iff their input_dimensions overlap. The
  vectorized-indexing detection in `intersect` and `chunk_resolution`
  uses this rule rather than the brittle "count >= 2" heuristic.
- _intersect_vectorized correctly preserves preserved (non-correlated)
  input dims when collapsing the correlated dims into a single
  surviving-points dim.
- _intersect on an orthogonal ArrayMap now correctly shrinks the input dim
  it parameterizes when filtering. (Previous code mutated the array
  without updating the input domain, which would now fail __post_init__.)
- _compose_array uses input_dimensions to determine parameterization
  uniformly; the previous arr.ndim-and-len(outer.output) heuristic is
  replaced with a clearer single-input-dim case.
- _apply_basic_indexing rebuilds ArrayMap input_dimensions correctly for
  newaxis (preserves the array's parameterization, just shifts the
  referenced dim index).
- _apply_oindex and _apply_vindex set input_dimensions on every newly-
  constructed ArrayMap.
- Drop unused helpers _reindex_array and _reindex_array_oindex.
- Test fixtures and assertions updated to use the new ArrayMap shape.

Why: the implicit-correlation-by-position convention was a footgun for
anyone constructing transforms directly. Making input_dimensions explicit
lets the type system enforce shape and correlation invariants, simplifies
the composition and intersect implementations, and produces a model that
can be coherently explained.

All 153 transforms tests pass (including the 200-example associativity
property test); 150 baseline indexing tests unchanged.
@github-actions github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label May 8, 2026
d-v-b added 2 commits May 8, 2026 10:05
Function-level `import itertools` inside iter_chunk_transforms moved
to module top. The TYPE_CHECKING-guarded imports are intentionally
inside their guard and remain there.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.67%. Comparing base (7c2f372) to head (542036f).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3956      +/-   ##
==========================================
+ Coverage   93.26%   93.67%   +0.40%     
==========================================
  Files          87       93       +6     
  Lines       11721    12481     +760     
==========================================
+ Hits        10932    11692     +760     
  Misses        789      789              
Files with missing lines Coverage Δ
src/zarr/core/_transforms/__init__.py 100.00% <100.00%> (ø)
src/zarr/core/_transforms/chunk_resolution.py 100.00% <100.00%> (ø)
src/zarr/core/_transforms/composition.py 100.00% <100.00%> (ø)
src/zarr/core/_transforms/domain.py 100.00% <100.00%> (ø)
src/zarr/core/_transforms/output_map.py 100.00% <100.00%> (ø)
src/zarr/core/_transforms/transform.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

d-v-b added 4 commits May 8, 2026 10:56
…view

Code review on the post-refactor branch found one real bug and three
minor follow-ups. All addressed:

1. **`_apply_oindex` on a multi-dim ArrayMap with 2+ axes selected by
   integer arrays** previously did `m.index_array[a, b]` which performs
   NumPy vectorized broadcasting (wrong) instead of orthogonal outer
   product (right). The shape check in `__post_init__` would catch the
   resulting wrong shape, but the user-facing error was opaque. Replace
   with an explicit `NotImplementedError`. The 1-D case (one axis) and
   the all-slices case (zero axes) are correct and remain unchanged.

   Adds three test cases:
   - test_oindex_on_1d_array_map_with_int_array (1-D path works)
   - test_oindex_on_2d_array_map_all_slices (multi-dim, all slices, works)
   - test_oindex_on_multi_dim_array_map_with_two_array_axes_errors
     (the now-guarded case raises NotImplementedError)

2. Reachability comments on the two remaining NotImplementedErrors in
   `_intersect` (multi-dim orthogonal ArrayMap) and `_intersect_vectorized`
   (DimensionMap on a correlated input dim). Both are reachable only via
   direct manual transform construction; the public oindex/vindex/basic
   API never produces these states.

3. Strengthen the hypothesis associativity property test to sample 5
   coords per generated triple instead of 1. Negligible cost (still
   <0.5s), better probabilistic coverage of any coord-dependent bugs.

Test count: 153 → 156. mypy clean.
Coverage was 83% before this commit. Audit identified meaningful gaps
across three tiers:

**Tier 1 - real gaps in production paths:**
- selection_repr / __repr__: zero tests (covers all OutputIndexMap kinds).
- IndexTransform.intersect() vectorized dispatch: tests called the
  internal _intersect_vectorized directly, never the public dispatch.
- __post_init__ validation paths from the input_dimensions refactor:
  out-of-range input_dim, duplicate input_dimensions, shape mismatch.
- _intersect_vectorized DimensionMap-on-non-correlated-dim path
  (the post-refactor remap of preserved input dims).
- _apply_basic_indexing rejects negative slice step.
- compose dim-outer / arr-inner case (the one untested cell of the
  composition matrix).
- vindex on a transform that already has an ArrayMap output
  (now-explicit NotImplementedError).
- _validate_basic_selection / _validate_array_selection reject
  unsupported types (e.g., float).
- ConstantMap survives basic / oindex / vindex unchanged
  (each apply function had this branch but no test exercised it).
- _normalize_basic_selection error paths: too-many-indices,
  double-ellipsis, unsupported type.

**Tier 2 - dead negative-stride DimensionMap paths:**

DimensionMap.stride was previously unconstrained, but the public API
(slice.indices(), oindex/vindex normalize, _apply_basic_indexing)
never produces a negative-stride DimensionMap, and composition
preserves sign so multiplication of positives stays positive. Every
negative-stride code path was unreachable. Two options were:
test those paths or delete them. Chose deletion + add a positive-stride
constraint to __post_init__.

Removed:
- chunk_resolution._iter_chunk_transforms negative-stride branch
- chunk_resolution.sub_transform_to_selections negative-stride branch
- transform._intersect DimensionMap negative-stride and zero-stride
  branches (zero was already unreachable; both deleted)

Added: __post_init__ rejects DimensionMap with stride <= 0.

**Tier 3 - defensive RuntimeErrors and unreachable guards:**

Added `# pragma: no cover` to four defensive RuntimeErrors and two
NotImplementedErrors that are reachable only via direct manual
construction of malformed transforms (the public oindex/vindex/basic
API never produces these states). Also marked the silent fallthrough
in _normalize_oindex_selection / _apply_vindex parsing (the upstream
_validate_array_selection rejects the unsupported types first).

**Other test additions:**
- iter_chunk_transforms over-estimate skip path (when intersect
  returns None for a candidate chunk in the chunk-range).
- iter_chunk_transforms empty-domain path.
- oindex with ellipsis and trailing-dim implicit fill.
- oindex/vindex with bare int and Python list selections.
- vindex with ellipsis, 2-D bool mask (consumes 2 dims), trailing
  slice padding.

Adjusted hypothesis associativity strategy to only generate positive
strides (matching the new __post_init__ constraint).

Test count: 156 → 191. Coverage: 83% → 99% (line+branch).
The remaining 1% is partial-branch artifacts from loop fall-through
and `else: continue` constructs that fire incidentally based on test
case order; not real semantic gaps.
Two complementary changes:

1. Add tests that exercise real production paths the previous tests
   missed:
   - selection_repr / __repr__ / translate / basic / oindex / intersect
     output loops with an ArrayMap-then-DimensionMap transform (covers
     the ArrayMap branch's loop-continuation path).
   - intersect with two uncorrelated ArrayMaps (disjoint
     input_dimensions): exercises the orthogonal path's vectorized-
     detection no-correlation-found exit.
   - iter_chunk_transforms / sub_transform_to_selections with
     ArrayMap-followed-by-DimensionMap output (loop continuation).
   - sub_transform_to_selections with out_indices and a non-ArrayMap
     output preceding an ArrayMap (correlation-detection skip).
   - sub_transform_to_selections with two uncorrelated ArrayMaps under
     out_indices (orthogonal path with two arraymap entries).
   - compose: outer with mixed types (ConstantMap + DimensionMap), inner
     ArrayMap pointing at the ConstantMap dim. Exercises the path
     where outer.output[dim_i] is ConstantMap (not all-constant outer)
     and falls through to NotImplementedError.

2. Add `# pragma: no branch` to elifs that close exhaustive type unions:
   `elif isinstance(m, ArrayMap)` after ConstantMap and DimensionMap
   (OutputIndexMap = ConstantMap | DimensionMap | ArrayMap, so the
   False outcome is structurally unreachable), and `elif isinstance(sel,
   slice)` after the corresponding earlier elifs in
   _apply_basic_indexing and _apply_oindex (for sel ∈ int|slice|None
   and sel ∈ ndarray|slice respectively). 11 sites total — coverage
   tools can't detect type-union exhaustiveness.

Test count: 191 → 202. Coverage: 99% → 100% (0 missed lines, 0 partial
branches).
@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented May 8, 2026

these data structures are not wired up to indexing yet, so we could safely merge this without changing any behavior.

@d-v-b d-v-b requested a review from ilan-gold May 8, 2026 15:56
@ilan-gold ilan-gold requested review from ilan-gold and removed request for ilan-gold May 8, 2026 17:15
@ilan-gold ilan-gold self-assigned this May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs release notes Automatically applied to PRs which haven't added release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants