Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idea: point on-disk representation using OME-NGFF #789

Open
keller-mark opened this issue Nov 14, 2024 · 4 comments
Open

idea: point on-disk representation using OME-NGFF #789

keller-mark opened this issue Nov 14, 2024 · 4 comments

Comments

@keller-mark
Copy link

keller-mark commented Nov 14, 2024

I made this diagram to try to illustrate the point storage idea I was attempting to explain on Wednesday at the Basel Hackathon.

The idea would be to use OME-NGFF for spatially-arranged point storage.

Diagram corresponding to a single chunk in space 0.0.0 (z,y,x)

diagram

Notes

One assumption here is that the Zarr array chunks will not be reshaped frequently. This would enable using relative offsets from the chunk edges to store the X-Y coordinates using a small dtype (e.g., uint8 if 256x256 chunk shapes) to be very efficient to load (e.g., over a network) for a large image. Another assumption is that Zarr's built-in compression will take care of negating the on-disk impact of the null-value padding when chunks are not filled completely (and that there could be a mechanism to somewhere annotate how far into the chunk is non-null).

With a MERFISH dataset, one idea would be for the point NGFF image to use X and Y dimensions at least as large as the underlying microscope image that the points originated from (i.e., the image prior to point detection), so that point x/y coordinates can be stored as integers without loss of information, but this does not necessarily need to be the case.

This is currently just an idea and could be benchmarked against conceptually simpler formats to check whether there is a performance benefit to justify it. I am not sure whether there would be drawbacks or benefits for operations such as querying.

@timtreis
Copy link
Member

timtreis commented Dec 9, 2024

One assumption here is that the Zarr array chunks will not be reshaped frequently.

Could you elaborate on that?

@keller-mark
Copy link
Author

Part of what I proposed is to store x- and y-coordinates of points relative to the edge of the Zarr chunk -- as opposed to "absolute" (i.e., relative to the edge of the full field-of-view of the image) -- in order to reduce the size requirements by allowing for usage of smaller numpy dtypes to store values.

This would mean the Zarr array values are dependent on the chunk shape of the array. Using a tool like rechunker to modify the chunk shape of the array would therefore need to be accompanied by changes to all array values (in addition to their reorganization on-disk).


Note that it would also be possible to discard the relative-coordinates part of the proposal yet still use the other aspects (which would alleviate concerns about array values being dependent on chunk shape, if that would be a blocker).

@LucaMarconato
Copy link
Member

Thanks @keller-mark for sharing, very clear explanation! (just a minor thing, I think there is a typo on the scatterplot, it should be P1(x:1, y:5, gene 0) and P2(x:4, y:4, gene 1)).

I like:

  1. the idea of being able to use a small dtype thanks to the relative positioning within the chunk;
  2. the separation of the various "columns" into different channels, which means that one would have the benefit of a columnar data format.

Three aspects that I find a bit less ideal are that:

  1. for having a small dtype one would need to have a small chunk size (but maybe this is not a problem now that sharding is available);
  2. that one would not be able to represent more than 256*256 points on a 256x256 chunk (but in practice I don't think one would encounter this limitation);
  3. that one would aim at having integer coordinates. But I think this can be indeed been worked around. Also, I think that TileDB actually also assumes integer coordinates. IIRC a point cloud is interpreted as an extremely spare matrix (1 matrix entry = 1 point).

Practically, we could benchmark IO, spatial queries and selection by a value column (e.g. gene id), against alternatives such as:

  1. geoparquet with spatial partitioning and geoarrow-encoded
  2. TileDB?
  3. something I will explain in my next message.

I think that the main limitation could be selection by gene id, while I think that spatial queries and IO would be fast (unless there is a problem with large number of files that is still present even after sharding).

@LucaMarconato
Copy link
Member

An alternative approach, also Zarr-based, would be to spatially partition the points into a grid of chunks. Then for each chunk store the list of points in that chunk (we could also use a small dtype relative to the chunk position). The resulting storage would be a hierarchical folder structure identical to NGFF (same consideration as you did for padding the last dimension). The difference with your approach would be that the size in physical unit of the chunk would be unrelated to the size in pixels of the chunk.

I expect my approach to suffer from selection by a value column, but not to suffer by the number of chunk, which can be kept arbitrary since it's independent from the physical units.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants