-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idea: point on-disk representation using OME-NGFF #789
Comments
Could you elaborate on that? |
Part of what I proposed is to store x- and y-coordinates of points relative to the edge of the Zarr chunk -- as opposed to "absolute" (i.e., relative to the edge of the full field-of-view of the image) -- in order to reduce the size requirements by allowing for usage of smaller numpy dtypes to store values. This would mean the Zarr array values are dependent on the chunk shape of the array. Using a tool like rechunker to modify the chunk shape of the array would therefore need to be accompanied by changes to all array values (in addition to their reorganization on-disk). Note that it would also be possible to discard the relative-coordinates part of the proposal yet still use the other aspects (which would alleviate concerns about array values being dependent on chunk shape, if that would be a blocker). |
Thanks @keller-mark for sharing, very clear explanation! (just a minor thing, I think there is a typo on the scatterplot, it should be I like:
Three aspects that I find a bit less ideal are that:
Practically, we could benchmark IO, spatial queries and selection by a value column (e.g. gene id), against alternatives such as:
I think that the main limitation could be selection by gene id, while I think that spatial queries and IO would be fast (unless there is a problem with large number of files that is still present even after sharding). |
An alternative approach, also Zarr-based, would be to spatially partition the points into a grid of chunks. Then for each chunk store the list of points in that chunk (we could also use a small dtype relative to the chunk position). The resulting storage would be a hierarchical folder structure identical to NGFF (same consideration as you did for padding the last dimension). The difference with your approach would be that the size in physical unit of the chunk would be unrelated to the size in pixels of the chunk. I expect my approach to suffer from selection by a value column, but not to suffer by the number of chunk, which can be kept arbitrary since it's independent from the physical units. |
I made this diagram to try to illustrate the point storage idea I was attempting to explain on Wednesday at the Basel Hackathon.
The idea would be to use OME-NGFF for spatially-arranged point storage.
Diagram corresponding to a single chunk in space 0.0.0 (z,y,x)
Notes
One assumption here is that the Zarr array chunks will not be reshaped frequently. This would enable using relative offsets from the chunk edges to store the X-Y coordinates using a small dtype (e.g.,
uint8
if 256x256 chunk shapes) to be very efficient to load (e.g., over a network) for a large image. Another assumption is that Zarr's built-in compression will take care of negating the on-disk impact of the null-value padding when chunks are not filled completely (and that there could be a mechanism to somewhere annotate how far into the chunk is non-null).With a MERFISH dataset, one idea would be for the point NGFF image to use X and Y dimensions at least as large as the underlying microscope image that the points originated from (i.e., the image prior to point detection), so that point x/y coordinates can be stored as integers without loss of information, but this does not necessarily need to be the case.
This is currently just an idea and could be benchmarked against conceptually simpler formats to check whether there is a performance benefit to justify it. I am not sure whether there would be drawbacks or benefits for operations such as querying.
The text was updated successfully, but these errors were encountered: