Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zarr.array from from an existing zarr.Array #2622

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

brokkoli71
Copy link
Member

@brokkoli71 brokkoli71 commented Jan 2, 2025

added concurrent streaming of source array into new array

Restriction

  • Only allow concurrent streaming if the chunk shape of the existing and new array match. Otherwise, while streaming the existing array and writing to the new one, we could be writing to the same file in parallel, resulting in a race condition. (Is there some measure to prevent this that I am not aware of?)

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@brokkoli71 brokkoli71 marked this pull request as draft January 2, 2025 16:54
@brokkoli71
Copy link
Member Author

Do we also want concurrency for different chunk sizes?

@normanrz
Copy link
Member

normanrz commented Jan 8, 2025

Do we also want concurrency for different chunk sizes?

That would be nice, if the chunk sizes are somewhat compatible, i.e. one is a multiple of the other.


# fill missing arguments with metadata of data Array
if chunks == "auto":
chunks = data.chunks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the intention for this to work with numpy arrays? because they don't have a chunks attribute. by contrast, dask arrays do have a chunks attribute, but it's a tuple of tuples of ints (because dask chunks can be irregularly sized). So maybe a bit more parsing is needed here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was only thinking of zarr arrays as data. But I can generalize it for numpy and dask

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I can generalize it for numpy and dask

I think that would be great! In the tests, we could use something like zarr.from_array(store=store, data=np.arange(10)) in many places!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of my inspiration comes from trying to make tests easier to write :)

it might also make sense to put in a keyword argument that controls whether data is written or not. some users might want to only create the array, and write data to it later with a different method. I think the default should be to avoid IO (i.e., don't write).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do it the other way around: add an optional metadata_only kwarg.
In any case, there would be IO to write the zarr.json.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes creating an array will always involve writing metadata, but users with TB scale datasets generally create the array first, then schedule writing to the array in a separate step. IMO attempting to write TB of data by default is not scalable for large datasets, and we should design around that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is not scalable for large arrays. But from an from_array method, I would expect it to actually create the array with the data from the array. Opt out is fine. Maybe what you are proposing should be called something else. No strong opinion here, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect from_array to create a zarr array. I would not expect it to fill the array with data, because I routinely use zarr with huge arrays. write-on-default would also be problematic for people with dask arrays, because zarr would not be in a position to decide which dask scheduler to use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 8, 2025

  • (Is there some measure to prevent this that I am not aware of?)

if you are trying to write K input chunks into M output chunks, you can partition your K chunks into sets, where within each set elements can be written independently from all the other elements. then you write each set one after another. in the worst case scenario there will be 1 set per chunk, but you are guaranteed to avoid write collisions this way.

@dstansby dstansby added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 9, 2025
@@ -3734,6 +3735,174 @@ class ShardsConfigParam(TypedDict):
ShardsLike: TypeAlias = ChunkCoords | ShardsConfigParam | Literal["auto"]


async def from_array(
data: Array,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data: Array,
data: Array | npt.ArrayLike,

As discussed, this function should also work with numpy arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[v3] zarr.array from from an existing zarr.Array
4 participants