Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changes/3573.feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Added new ``Array.chunk_slices`` and ``Array.shard_slices`` to get slices aligned with array chunks and shards respectively.
20 changes: 20 additions & 0 deletions docs/user-guide/arrays.md
Original file line number Diff line number Diff line change
Expand Up @@ -566,6 +566,26 @@ In this example a shard shape of (1000, 1000) and a chunk shape of (100, 100) is
This means that `10*10` chunks are stored in each shard, and there are `10*10` shards in total.
Without the `shards` argument, there would be 10,000 chunks stored as individual files.

## Accessing chunks and shards

Arrays have useful properties for accessing data aligned to chunks and shards.
This can be useful for getting slices that can be used to write to shards in parallel, or read from chunks in parallel.

```python exec="true" session="arrays" source="above" result="ansi"
a = zarr.create_array(store={}, shape=(100, 50), shards=(50, 40), chunks=(25, 20), dtype='uint8')

print("All shard slices:")
for shard_slice in a.shard_slices:
print(shard_slice)
# shard_data = a[shard_slice]

print("All chunk slices:")
for chunk_slice in a.chunk_slices:
print(chunk_slice)
# chunk_data = a[chunk_slice]
```


## Missing features in 3.0

The following features have not been ported to 3.0 yet.
Expand Down
56 changes: 55 additions & 1 deletion src/zarr/core/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import json
import warnings
from asyncio import gather
from collections.abc import Iterable, Mapping
from collections.abc import Generator, Iterable, Mapping
from dataclasses import dataclass, field, replace
from itertools import starmap
from logging import getLogger
Expand Down Expand Up @@ -1381,6 +1381,32 @@ async def example():
async def nbytes_stored(self) -> int:
return await self.store_path.store.getsize_prefix(self.store_path.path)

@property
def chunk_slices(self) -> Generator[tuple[slice, ...]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicking the name: i find "slice" to be kind of ambiguous between a verb and a noun, and it also locks us in to returning slice objects, so what if we use the word "region" instead?

and also I think it's helpful if the name of this routine makes it clear that it's an iterator. So what if we call it iter_chunk_regions, i.e. the name of the routine it wraps 😜

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re. name, I think "slices" is unambiguously a noun because it's plural? I looked at NumPy (https://numpy.org/doc/stable/user/basics.indexing.html#slicing-and-striding), and they use the terms "index", "selection tuple", or "slicing tuple". If we try and stay consistent with NumPy, how about "chunk_indices"?

I like "regions", but to me it's ambiguous whether that means the index, or the array data at that index. If we settle on it that could be fixed by documentation and consistent use though.

re. iterator, do you mean generator? A list/string etc. are also iterators, but using a yield makes this into more specifically a generator.

So perhaps generate_chunk_indices? I find that a bit clunky though,

for chunk_index in arr.chunk_indices:

is much nicer than

for chunk_index in arr.generate_chunk_indices:

(or iter_chunk_indices)

which is why I prefer simply chunk_indices. I don't knkow if there's prior art in other libraries for naming generators?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generators are iterators, and my thinking was that putting "iter" in the name conveys that users should expect to iterate over the value returned by calling this method.

when I think of iterating over indices, I think of iterating over tuples of coordinates, e.g., (0, 0, 0), (0, 0, 1), .... which is what _iter_chunk_coords / _iter_shard_coords do right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like "regions", but to me it's ambiguous whether that means the index, or the array data at that index. If we settle on it that could be fixed by documentation and consistent use though.

I worry that this ambiguity will hold for any name we pick :)

"""
Iterator over all chunks.

Yields
------
chunk_slice :
Slice for each chunk in this array.
"""
yield from self._iter_chunk_regions()

@property
def shard_slices(self) -> Generator[tuple[slice, ...]]:
"""
Iterator over all shards.

This can be used to loop through and index every shard of an array.

Yields
------
shard_slice :
Slice for each shard in this array.
"""
yield from self._iter_shard_regions()

def _iter_chunk_coords(
self, *, origin: Sequence[int] | None = None, selection_shape: Sequence[int] | None = None
) -> Iterator[tuple[int, ...]]:
Expand Down Expand Up @@ -2355,6 +2381,34 @@ def shards(self) -> tuple[int, ...] | None:
"""
return self._async_array.shards

@property
def chunk_slices(self) -> Generator[tuple[slice, ...]]:
"""
Iterator over all chunks.

This can be used to loop through and index every chunk of an array.

Yields
------
chunk_slice :
Slice for each chunk in this array.
"""
yield from self._async_array.chunk_slices

@property
def shard_slices(self) -> Generator[tuple[slice, ...]]:
"""
Iterator over all shards.

This can be used to loop through and index every shard of an array.

Yields
------
shard_slice :
Slice for each shard in this array.
"""
yield from self._async_array.shard_slices

@property
def size(self) -> int:
"""Returns the total number of elements in the array.
Expand Down
21 changes: 21 additions & 0 deletions tests/test_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -2153,3 +2153,24 @@ def test_create_array_with_data_num_gets(
# one get for the metadata and one per shard.
# Note: we don't actually need one get per shard, but this is the current behavior
assert store.counter["get"] == 1 + num_shards


@pytest.mark.parametrize("shards", [None, (4, 6)])
def test_chunk_slices(shards: None | tuple[int, ...]) -> None:
arr = zarr.create_array(store={}, shape=(4, 8), dtype="uint8", chunks=(2, 3), shards=shards)
assert list(arr.chunk_slices) == [
(slice(0, 2, 1), slice(0, 3, 1)),
(slice(0, 2, 1), slice(3, 6, 1)),
(slice(0, 2, 1), slice(6, 8, 1)),
(slice(2, 4, 1), slice(0, 3, 1)),
(slice(2, 4, 1), slice(3, 6, 1)),
(slice(2, 4, 1), slice(6, 8, 1)),
]


def test_shard_slices() -> None:
arr = zarr.create_array(store={}, shape=(4, 8), dtype="uint8", chunks=(2, 3), shards=(4, 6))
assert list(arr.shard_slices) == [
(slice(0, 4, 1), slice(0, 6, 1)),
(slice(0, 4, 1), slice(6, 8, 1)),
]