idea: point on-disk representation using OME-NGFF #789

keller-mark · 2024-11-14T23:39:06Z

I made this diagram to try to illustrate the point storage idea I was attempting to explain on Wednesday at the Basel Hackathon.

The idea would be to use OME-NGFF for spatially-arranged point storage.

Diagram corresponding to a single chunk in space 0.0.0 (z,y,x)

Notes

One assumption here is that the Zarr array chunks will not be reshaped frequently. This would enable using relative offsets from the chunk edges to store the X-Y coordinates using a small dtype (e.g., uint8 if 256x256 chunk shapes) to be very efficient to load (e.g., over a network) for a large image. Another assumption is that Zarr's built-in compression will take care of negating the on-disk impact of the null-value padding when chunks are not filled completely (and that there could be a mechanism to somewhere annotate how far into the chunk is non-null).

With a MERFISH dataset, one idea would be for the point NGFF image to use X and Y dimensions at least as large as the underlying microscope image that the points originated from (i.e., the image prior to point detection), so that point x/y coordinates can be stored as integers without loss of information, but this does not necessarily need to be the case.

This is currently just an idea and could be benchmarked against conceptually simpler formats to check whether there is a performance benefit to justify it. I am not sure whether there would be drawbacks or benefits for operations such as querying.

The text was updated successfully, but these errors were encountered:

timtreis · 2024-12-09T04:00:39Z

One assumption here is that the Zarr array chunks will not be reshaped frequently.

Could you elaborate on that?

keller-mark · 2024-12-09T22:27:17Z

Part of what I proposed is to store x- and y-coordinates of points relative to the edge of the Zarr chunk -- as opposed to "absolute" (i.e., relative to the edge of the full field-of-view of the image) -- in order to reduce the size requirements by allowing for usage of smaller numpy dtypes to store values.

This would mean the Zarr array values are dependent on the chunk shape of the array. Using a tool like rechunker to modify the chunk shape of the array would therefore need to be accompanied by changes to all array values (in addition to their reorganization on-disk).

Note that it would also be possible to discard the relative-coordinates part of the proposal yet still use the other aspects (which would alleviate concerns about array values being dependent on chunk shape, if that would be a blocker).

LucaMarconato · 2025-01-05T23:14:18Z

Thanks @keller-mark for sharing, very clear explanation! (just a minor thing, I think there is a typo on the scatterplot, it should be P1(x:1, y:5, gene 0) and P2(x:4, y:4, gene 1)).

I like:

the idea of being able to use a small dtype thanks to the relative positioning within the chunk;
the separation of the various "columns" into different channels, which means that one would have the benefit of a columnar data format.

Three aspects that I find a bit less ideal are that:

for having a small dtype one would need to have a small chunk size (but maybe this is not a problem now that sharding is available);
that one would not be able to represent more than 256*256 points on a 256x256 chunk (but in practice I don't think one would encounter this limitation);
that one would aim at having integer coordinates. But I think this can be indeed been worked around. Also, I think that TileDB actually also assumes integer coordinates. IIRC a point cloud is interpreted as an extremely spare matrix (1 matrix entry = 1 point).

Practically, we could benchmark IO, spatial queries and selection by a value column (e.g. gene id), against alternatives such as:

geoparquet with spatial partitioning and geoarrow-encoded
TileDB?
something I will explain in my next message.

I think that the main limitation could be selection by gene id, while I think that spatial queries and IO would be fast (unless there is a problem with large number of files that is still present even after sharding).

LucaMarconato · 2025-01-05T23:20:32Z

An alternative approach, also Zarr-based, would be to spatially partition the points into a grid of chunks. Then for each chunk store the list of points in that chunk (we could also use a small dtype relative to the chunk position). The resulting storage would be a hierarchical folder structure identical to NGFF (same consideration as you did for padding the last dimension). The difference with your approach would be that the size in physical unit of the chunk would be unrelated to the size in pixels of the chunk.

I expect my approach to suffer from selection by a value column, but not to suffer by the number of chunk, which can be kept arbitrary since it's independent from the physical units.

keller-mark mentioned this issue Nov 21, 2024

Track status of SpatialData on-disk representation of points vitessce/vitessce#1976

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idea: point on-disk representation using OME-NGFF #789

idea: point on-disk representation using OME-NGFF #789

keller-mark commented Nov 14, 2024 •

edited

Loading

timtreis commented Dec 9, 2024

keller-mark commented Dec 9, 2024

LucaMarconato commented Jan 5, 2025

LucaMarconato commented Jan 5, 2025

idea: point on-disk representation using OME-NGFF #789

idea: point on-disk representation using OME-NGFF #789

Comments

keller-mark commented Nov 14, 2024 • edited Loading

Diagram corresponding to a single chunk in space 0.0.0 (z,y,x)

Notes

timtreis commented Dec 9, 2024

keller-mark commented Dec 9, 2024

LucaMarconato commented Jan 5, 2025

LucaMarconato commented Jan 5, 2025

keller-mark commented Nov 14, 2024 •

edited

Loading