`zarr.array` from from an existing `zarr.Array` #2622

brokkoli71 · 2025-01-02T15:47:59Z

fixes [v3] zarr.array from from an existing zarr.Array #2410

added concurrent streaming of source array into new array

Restriction

Only allow concurrent streaming if the chunk shape of the existing and new array match. Otherwise, while streaming the existing array and writing to the new one, we could be writing to the same file in parallel, resulting in a race condition. (Is there some measure to prevent this that I am not aware of?)

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

# Conflicts: # tests/test_array.py

brokkoli71 · 2025-01-08T16:49:49Z

Do we also want concurrency for different chunk sizes?

normanrz · 2025-01-08T16:51:21Z

Do we also want concurrency for different chunk sizes?

That would be nice, if the chunk sizes are somewhat compatible, i.e. one is a multiple of the other.

d-v-b · 2025-01-08T16:53:05Z

src/zarr/core/array.py

+
+    # fill missing arguments with metadata of data Array
+    if chunks == "auto":
+        chunks = data.chunks


is the intention for this to work with numpy arrays? because they don't have a chunks attribute. by contrast, dask arrays do have a chunks attribute, but it's a tuple of tuples of ints (because dask chunks can be irregularly sized). So maybe a bit more parsing is needed here.

I was only thinking of zarr arrays as data. But I can generalize it for numpy and dask

But I can generalize it for numpy and dask

I think that would be great! In the tests, we could use something like zarr.from_array(store=store, data=np.arange(10)) in many places!

a lot of my inspiration comes from trying to make tests easier to write :)

it might also make sense to put in a keyword argument that controls whether data is written or not. some users might want to only create the array, and write data to it later with a different method. I think the default should be to avoid IO (i.e., don't write).

I would do it the other way around: add an optional metadata_only kwarg.
In any case, there would be IO to write the zarr.json.

yes creating an array will always involve writing metadata, but users with TB scale datasets generally create the array first, then schedule writing to the array in a separate step. IMO attempting to write TB of data by default is not scalable for large datasets, and we should design around that case.

I know it is not scalable for large arrays. But from an from_array method, I would expect it to actually create the array with the data from the array. Opt out is fine. Maybe what you are proposing should be called something else. No strong opinion here, though.

I would expect from_array to create a zarr array. I would not expect it to fill the array with data, because I routinely use zarr with huge arrays. write-on-default would also be problematic for people with dask arrays, because zarr would not be in a position to decide which dask scheduler to use.

d-v-b · 2025-01-08T17:10:10Z

(Is there some measure to prevent this that I am not aware of?)

if you are trying to write K input chunks into M output chunks, you can partition your K chunks into sets, where within each set elements can be written independently from all the other elements. then you write each set one after another. in the worst case scenario there will be 1 set per chunk, but you are guaranteed to avoid write collisions this way.

normanrz · 2025-01-09T19:42:15Z

src/zarr/core/array.py

@@ -3734,6 +3735,174 @@ class ShardsConfigParam(TypedDict):
 ShardsLike: TypeAlias = ChunkCoords | ShardsConfigParam | Literal["auto"]


+async def from_array(
+    data: Array,


Suggested change

data: Array,

data: Array | npt.ArrayLike,

As discussed, this function should also work with numpy arrays.

brokkoli71 added 2 commits January 2, 2025 16:37

add creation from other zarr

5483956

remove duplicated tests

9a32d1f

brokkoli71 marked this pull request as draft January 2, 2025 16:54

brokkoli71 added 10 commits January 2, 2025 18:14

improve test

2c19072

test_iter_grid for non-squares

91152be

concurrent streaming for equal chunk sizes

3c5ec3f

Merge branch 'main' into creation-from-other-zarr

a675475

# Conflicts: # tests/test_array.py

fix merge

79a45b1

fix mypy

da2f03f

Merge branch 'main' into creation-from-other-zarr

a7553a7

fix mypy

7728d7f

fix test_iter_grid

2df18a0

extract to zarr.from_array

03e2500

d-v-b reviewed Jan 8, 2025

View reviewed changes

brokkoli71 added 5 commits January 8, 2025 18:13

fix mypy

f6ae2f8

fix mypy

36146e5

format

085efe9

Merge branch 'main' into creation-from-other-zarr

353a477

fix test_creation_from_other_zarr_format

93ed8d6

dstansby added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 9, 2025

normanrz reviewed Jan 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`zarr.array` from from an existing `zarr.Array` #2622

`zarr.array` from from an existing `zarr.Array` #2622

brokkoli71 commented Jan 2, 2025 •

edited

Loading

brokkoli71 commented Jan 8, 2025

normanrz commented Jan 8, 2025

d-v-b Jan 8, 2025

brokkoli71 Jan 8, 2025

normanrz Jan 8, 2025

d-v-b Jan 8, 2025

normanrz Jan 8, 2025

d-v-b Jan 8, 2025

normanrz Jan 8, 2025

d-v-b Jan 8, 2025

normanrz Jan 9, 2025

d-v-b commented Jan 8, 2025

normanrz Jan 9, 2025

zarr.array from from an existing zarr.Array #2622

Are you sure you want to change the base?

zarr.array from from an existing zarr.Array #2622

Conversation

brokkoli71 commented Jan 2, 2025 • edited Loading

Restriction

brokkoli71 commented Jan 8, 2025

normanrz commented Jan 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-v-b commented Jan 8, 2025

Choose a reason for hiding this comment

`zarr.array` from from an existing `zarr.Array` #2622

`zarr.array` from from an existing `zarr.Array` #2622

brokkoli71 commented Jan 2, 2025 •

edited

Loading