[lazy indexing] transforms #3956
Open
d-v-b wants to merge 26 commits intozarr-developers:mainfrom
Open
Conversation
Empty package, no exports yet. Subsequent commits port modules from PR zarr-developers#3906's transforms package and rewire imports to the private name.
Defines ConstantMap, DimensionMap, ArrayMap, and the OutputIndexMap union. No in-package dependencies.
Defines IndexDomain (a rectangular integer-coordinate region) plus the _normalize_selection helper used by transform construction.
Defines IndexTransform plus selection_to_transform and the per-mode application functions (_apply_basic_indexing, _apply_oindex, _apply_vindex). This is the core of the package.
Defines compose(outer, inner) -> IndexTransform and the per-output-map dispatch helpers.
Defines iter_chunk_transforms and sub_transform_to_selections, which bridge a transform to per-chunk selections. No callers in this PR; included so the package is internally complete and ready for future internal-rewiring work.
Re-export IndexDomain, IndexTransform, the OutputIndexMap variants, and compose from the package root. JSON helpers are intentionally excluded; chunk_resolution exports are intentionally not re-exported (no callers in this PR).
…ssor reference The docstrings inherited RST-style double-backticks from the reference branch. Convert to markdown single-backticks throughout (zarr-python's docs are markdown-rendered, so the doubles render as literal `` text). Also drop the stale "Every Array holds a transform" / "Array.z[...]" comment from transform.py — those statements describe an earlier design where the transform lived on Array; the actual lazy view design parks the transform on a separate wrapper type.
Replace class-based test grouping with top-level parametrized functions. Each test covers either all success branches or all error branches of a single concern. Adds tests/test_transforms/conftest.py with project-style `Expect[TIn, TOut]` and `ExpectErr[TIn]` test-case helpers. The new mutation_errors test actually exercises FrozenInstanceError (the previous test_frozen variants only checked isinstance, which the @DataClass decorator guarantees regardless of frozen=True).
One success test and (where applicable) one error test per public function on IndexDomain. Adds direct tests for the non-trivial private helper _normalize_selection, including the previously-untested double-ellipsis branch. Coverage: 34 tests → 53 tests.
The function bridged NumPy-style negative indices to TensorStore-style absolute coordinates. It was originally used by the eager-path rewiring in PR zarr-developers#3906 (4 callsites in src/zarr/core/array.py). This PR has no such rewiring — PR #2's lazy view materializes via the existing eager indexing path, which already handles negatives — so the helper has no callers in our shipping plan. Re-add when (and if) the internal-rewiring follow-up arrives. Removing it now keeps the package surface lean and the diff focused.
One success test (and where applicable one error test) per public method on IndexTransform plus the public selection_to_transform function. Adds direct tests for _intersect_vectorized — the correlated multi-ArrayMap intersection path. Public `intersect` only reaches it when the transform has 2+ ArrayMap outputs; all the public-surface intersect cases use a single ArrayMap, so this branch was previously unreached by tests.
Six (inner_kind, outer_kind) pairs for the compose dispatch matrix collapse from six per-case methods into one parametrized test. The non-trivial private helpers (_compose_single, _compose_dimension, _compose_array) are reached transitively via this matrix and need no direct tests.
iter_chunk_transforms cases parametrized over (transform, grid) inputs; sub_transform_to_selections cases parametrized over the output-map-kind matrix. Adds 3 cases that exercise the previously-untested out_indices parameter on sub_transform_to_selections: orthogonal-array-with- out_indices, vectorized-with-out_indices, and the implicit fallback when out_indices is None.
When mypy runs over the rewritten test files together (rather than one at a time), it tightens up two type checks that were lax in isolation: - selection_to_transform's mode parameter is a Literal, not str. - assert_array_equal expects ndarray, but out_sel[0] is typed as slice | ndarray. Add an isinstance check. These were latent issues in the rewrites; they only surfaced when mypy saw all the test files at once.
…main.py Missed by the earlier polish commit (fca2926): the docstring at the top of domain.py used an RST `::` indented code block. Convert to a fenced markdown block to match the rest of the package's docstrings.
…nches Code review surfaced six public functions/methods with reachable error branches and no test coverage. Add parametrized error tests (or standalone tests, where there is one trivial input) to close the gaps: - IndexTransform.oindex: negative slice step raises IndexError. - IndexTransform.vindex: negative slice step raises IndexError. - IndexTransform.intersect: rank-mismatched output_domain raises ValueError. - IndexTransform.translate: wrong-length shift raises ValueError. - selection_to_transform: unknown mode string raises ValueError. - compose: 2D ArrayMap inner with non-constant outer raises NotImplementedError. Test count: 145 → 152.
…vity
An IndexTransform models a function from input coords to output coords;
function composition is associative by definition. This test verifies
that the implementation preserves that algebraic property by sampling
random affine triples (a, b, c) with compatible ranks and checking that
compose(compose(a, b), c)
evaluates the same as
compose(a, compose(b, c))
at random points in a's domain. 200 hypothesis examples; passes cleanly.
Restricted to DimensionMap + ConstantMap outputs (the affine subset).
ArrayMap composition has implementation-level branching that depends on
outer structure and would need a more careful generator to avoid the
NotImplementedError path; the affine case is the algebraic core, the
ArrayMap case is deferred to a follow-up.
Previously, ArrayMap relied on an implicit convention to encode which input dims it was parameterized by: the array shape was matched against the input domain's shape by length, and "correlation" between two ArrayMaps (vectorized indexing) was inferred from "two ArrayMaps exist in the same transform." This refactor makes parameterization explicit: - ArrayMap gains `input_dimensions: tuple[int, ...]` as a required field. Each axis of the index_array corresponds to one entry in input_dimensions, in order. - IndexTransform.__post_init__ enforces that ArrayMap.index_array.shape equals the input domain's extent on input_dimensions. Out-of-range and duplicate input dims are also rejected. - Two ArrayMaps are correlated iff their input_dimensions overlap. The vectorized-indexing detection in `intersect` and `chunk_resolution` uses this rule rather than the brittle "count >= 2" heuristic. - _intersect_vectorized correctly preserves preserved (non-correlated) input dims when collapsing the correlated dims into a single surviving-points dim. - _intersect on an orthogonal ArrayMap now correctly shrinks the input dim it parameterizes when filtering. (Previous code mutated the array without updating the input domain, which would now fail __post_init__.) - _compose_array uses input_dimensions to determine parameterization uniformly; the previous arr.ndim-and-len(outer.output) heuristic is replaced with a clearer single-input-dim case. - _apply_basic_indexing rebuilds ArrayMap input_dimensions correctly for newaxis (preserves the array's parameterization, just shifts the referenced dim index). - _apply_oindex and _apply_vindex set input_dimensions on every newly- constructed ArrayMap. - Drop unused helpers _reindex_array and _reindex_array_oindex. - Test fixtures and assertions updated to use the new ArrayMap shape. Why: the implicit-correlation-by-position convention was a footgun for anyone constructing transforms directly. Making input_dimensions explicit lets the type system enforce shape and correlation invariants, simplifies the composition and intersect implementations, and produces a model that can be coherently explained. All 153 transforms tests pass (including the 200-example associativity property test); 150 baseline indexing tests unchanged.
Function-level `import itertools` inside iter_chunk_transforms moved to module top. The TYPE_CHECKING-guarded imports are intentionally inside their guard and remain there.
…b.com/d-v-b/zarr-python into worktree-lazy-indexing-pr1-transforms
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3956 +/- ##
==========================================
+ Coverage 93.26% 93.67% +0.40%
==========================================
Files 87 93 +6
Lines 11721 12481 +760
==========================================
+ Hits 10932 11692 +760
Misses 789 789
🚀 New features to boost your workflow:
|
…view
Code review on the post-refactor branch found one real bug and three
minor follow-ups. All addressed:
1. **`_apply_oindex` on a multi-dim ArrayMap with 2+ axes selected by
integer arrays** previously did `m.index_array[a, b]` which performs
NumPy vectorized broadcasting (wrong) instead of orthogonal outer
product (right). The shape check in `__post_init__` would catch the
resulting wrong shape, but the user-facing error was opaque. Replace
with an explicit `NotImplementedError`. The 1-D case (one axis) and
the all-slices case (zero axes) are correct and remain unchanged.
Adds three test cases:
- test_oindex_on_1d_array_map_with_int_array (1-D path works)
- test_oindex_on_2d_array_map_all_slices (multi-dim, all slices, works)
- test_oindex_on_multi_dim_array_map_with_two_array_axes_errors
(the now-guarded case raises NotImplementedError)
2. Reachability comments on the two remaining NotImplementedErrors in
`_intersect` (multi-dim orthogonal ArrayMap) and `_intersect_vectorized`
(DimensionMap on a correlated input dim). Both are reachable only via
direct manual transform construction; the public oindex/vindex/basic
API never produces these states.
3. Strengthen the hypothesis associativity property test to sample 5
coords per generated triple instead of 1. Negligible cost (still
<0.5s), better probabilistic coverage of any coord-dependent bugs.
Test count: 153 → 156. mypy clean.
Coverage was 83% before this commit. Audit identified meaningful gaps across three tiers: **Tier 1 - real gaps in production paths:** - selection_repr / __repr__: zero tests (covers all OutputIndexMap kinds). - IndexTransform.intersect() vectorized dispatch: tests called the internal _intersect_vectorized directly, never the public dispatch. - __post_init__ validation paths from the input_dimensions refactor: out-of-range input_dim, duplicate input_dimensions, shape mismatch. - _intersect_vectorized DimensionMap-on-non-correlated-dim path (the post-refactor remap of preserved input dims). - _apply_basic_indexing rejects negative slice step. - compose dim-outer / arr-inner case (the one untested cell of the composition matrix). - vindex on a transform that already has an ArrayMap output (now-explicit NotImplementedError). - _validate_basic_selection / _validate_array_selection reject unsupported types (e.g., float). - ConstantMap survives basic / oindex / vindex unchanged (each apply function had this branch but no test exercised it). - _normalize_basic_selection error paths: too-many-indices, double-ellipsis, unsupported type. **Tier 2 - dead negative-stride DimensionMap paths:** DimensionMap.stride was previously unconstrained, but the public API (slice.indices(), oindex/vindex normalize, _apply_basic_indexing) never produces a negative-stride DimensionMap, and composition preserves sign so multiplication of positives stays positive. Every negative-stride code path was unreachable. Two options were: test those paths or delete them. Chose deletion + add a positive-stride constraint to __post_init__. Removed: - chunk_resolution._iter_chunk_transforms negative-stride branch - chunk_resolution.sub_transform_to_selections negative-stride branch - transform._intersect DimensionMap negative-stride and zero-stride branches (zero was already unreachable; both deleted) Added: __post_init__ rejects DimensionMap with stride <= 0. **Tier 3 - defensive RuntimeErrors and unreachable guards:** Added `# pragma: no cover` to four defensive RuntimeErrors and two NotImplementedErrors that are reachable only via direct manual construction of malformed transforms (the public oindex/vindex/basic API never produces these states). Also marked the silent fallthrough in _normalize_oindex_selection / _apply_vindex parsing (the upstream _validate_array_selection rejects the unsupported types first). **Other test additions:** - iter_chunk_transforms over-estimate skip path (when intersect returns None for a candidate chunk in the chunk-range). - iter_chunk_transforms empty-domain path. - oindex with ellipsis and trailing-dim implicit fill. - oindex/vindex with bare int and Python list selections. - vindex with ellipsis, 2-D bool mask (consumes 2 dims), trailing slice padding. Adjusted hypothesis associativity strategy to only generate positive strides (matching the new __post_init__ constraint). Test count: 156 → 191. Coverage: 83% → 99% (line+branch). The remaining 1% is partial-branch artifacts from loop fall-through and `else: continue` constructs that fire incidentally based on test case order; not real semantic gaps.
Two complementary changes:
1. Add tests that exercise real production paths the previous tests
missed:
- selection_repr / __repr__ / translate / basic / oindex / intersect
output loops with an ArrayMap-then-DimensionMap transform (covers
the ArrayMap branch's loop-continuation path).
- intersect with two uncorrelated ArrayMaps (disjoint
input_dimensions): exercises the orthogonal path's vectorized-
detection no-correlation-found exit.
- iter_chunk_transforms / sub_transform_to_selections with
ArrayMap-followed-by-DimensionMap output (loop continuation).
- sub_transform_to_selections with out_indices and a non-ArrayMap
output preceding an ArrayMap (correlation-detection skip).
- sub_transform_to_selections with two uncorrelated ArrayMaps under
out_indices (orthogonal path with two arraymap entries).
- compose: outer with mixed types (ConstantMap + DimensionMap), inner
ArrayMap pointing at the ConstantMap dim. Exercises the path
where outer.output[dim_i] is ConstantMap (not all-constant outer)
and falls through to NotImplementedError.
2. Add `# pragma: no branch` to elifs that close exhaustive type unions:
`elif isinstance(m, ArrayMap)` after ConstantMap and DimensionMap
(OutputIndexMap = ConstantMap | DimensionMap | ArrayMap, so the
False outcome is structurally unreachable), and `elif isinstance(sel,
slice)` after the corresponding earlier elifs in
_apply_basic_indexing and _apply_oindex (for sel ∈ int|slice|None
and sel ∈ ndarray|slice respectively). 11 sites total — coverage
tools can't detect type-union exhaustiveness.
Test count: 191 → 202. Coverage: 99% → 100% (0 missed lines, 0 partial
branches).
Contributor
Author
|
these data structures are not wired up to indexing yet, so we could safely merge this without changing any behavior. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a claude-assissted PR derived from #3906 that breaks out the index transformation functionality into its own module.
Ultimately we will model indexing an array as a mapping from the source indices to a set of output indices via an index transformation. The data structures here define the pieces of that model.
IndexDomainis a model of a contiguous domain of n-dimensional array indices. This tracks the domain of an array that is being indexed. It's basically the set of valid coordinates for an indexable array, represented implicitly.OutputIndexMapis a union of types that model where, in an output (like a chunk), the indexing operation will send array elements.IndexTransformis a mapping from anIndexDomainto a collection ofOutputIndexMapelements. In a later PR, every Zarr array will be associated with anIndexTransform, which tracks how the array's indices are "projected" to stored indices.