-
-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zarr.array
from from an existing zarr.Array
#2622
base: main
Are you sure you want to change the base?
zarr.array
from from an existing zarr.Array
#2622
Conversation
# Conflicts: # tests/test_array.py
Do we also want concurrency for different chunk sizes? |
That would be nice, if the chunk sizes are somewhat compatible, i.e. one is a multiple of the other. |
|
||
# fill missing arguments with metadata of data Array | ||
if chunks == "auto": | ||
chunks = data.chunks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the intention for this to work with numpy arrays? because they don't have a chunks
attribute. by contrast, dask arrays do have a chunks attribute, but it's a tuple of tuples of ints (because dask chunks can be irregularly sized). So maybe a bit more parsing is needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was only thinking of zarr arrays as data. But I can generalize it for numpy and dask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I can generalize it for numpy and dask
I think that would be great! In the tests, we could use something like zarr.from_array(store=store, data=np.arange(10))
in many places!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a lot of my inspiration comes from trying to make tests easier to write :)
it might also make sense to put in a keyword argument that controls whether data is written or not. some users might want to only create the array, and write data to it later with a different method. I think the default should be to avoid IO (i.e., don't write).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do it the other way around: add an optional metadata_only kwarg.
In any case, there would be IO to write the zarr.json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes creating an array will always involve writing metadata, but users with TB scale datasets generally create the array first, then schedule writing to the array in a separate step. IMO attempting to write TB of data by default is not scalable for large datasets, and we should design around that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it is not scalable for large arrays. But from an from_array
method, I would expect it to actually create the array with the data from the array. Opt out is fine. Maybe what you are proposing should be called something else. No strong opinion here, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect from_array
to create a zarr array. I would not expect it to fill the array with data, because I routinely use zarr with huge arrays. write-on-default would also be problematic for people with dask arrays, because zarr would not be in a position to decide which dask scheduler to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
if you are trying to write |
@@ -3734,6 +3735,174 @@ class ShardsConfigParam(TypedDict): | |||
ShardsLike: TypeAlias = ChunkCoords | ShardsConfigParam | Literal["auto"] | |||
|
|||
|
|||
async def from_array( | |||
data: Array, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data: Array, | |
data: Array | npt.ArrayLike, |
As discussed, this function should also work with numpy arrays.
zarr.Array
#2410added concurrent streaming of source array into new array
Restriction
TODO: