Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Represent (coordinate) variables "symbolically" #361

Open
TomAugspurger opened this issue Sep 21, 2023 · 9 comments
Open

Represent (coordinate) variables "symbolically" #361

TomAugspurger opened this issue Sep 21, 2023 · 9 comments

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 21, 2023

I'm working with a GRIB2 file, and am interested in minimizing the size of the references file. Currently, the largest values in the references come from the base64-encoded coordinates that were inlined in the references:

  'latitude/0': 'base64:AAAAAACAVkBmZmZ...',

This specific variable (and longitude, step, and perhaps time) can be represented "symbolically" (maybe not the right name), with something like a range(90, -90.1, -0.4).

My questions:

  1. Does something like this make sense to try?
  2. Does this instead belong in Zarr instead? It seems more generally useful to compress the size of the data, beyond just what Kerchunk inlines (though I'd still want it in Kerchunk, so that inlined references can benefit from it).

Somewhat annoyingly, there are floating point inaccuracies between what I get from np.arange and what's coming out of cfgrib. But hopefully those can be solved.

@martindurant
Copy link
Member

This is certainly something that kerchunk could do, with effectively our own codec to expand whatever representation into an array at read time. That would be simple for linear coordinates, but GRIB allows for many complex coordinate definitions. I suppose it's possible to extract the parameters of whatever the coordinate system is, but we probably don't want to implement the coordinate generation algorithms, but call the appropriate functions in eccodes itself, if we can.

This all connects to the possibility of analytical coordinates in xarray. Perhaps we shouldn't be making arrays even at read time but making xarray indexes.

@dcherian
Copy link
Contributor

There's a CF convention for that!

We could totally interpret those as a "functional xarray index" too.

@martindurant
Copy link
Member

There's a CF convention for that

(plus also the FITS WCS ways to define the same; you won't get these from geo-datasets, but I think they may be more general)

@martindurant
Copy link
Member

People on this thread might be interested in the intake-stac sprint intake/intake-stac#159

@TomAugspurger
Copy link
Contributor Author

Thanks @dcherian. IIUC, the coordinate subsampling you linked to is essentially the same as range(0, 10, 1)? We just have two "tie points" (the first and last point) and then linearly interpolate between them?

Do you know if this decoding is implemented in cf-xarray or xarray.conventions.decode_cf_variable? I didn't see it at https://cf-xarray.readthedocs.io/en/latest/coding.html or in a glanace at decode_cf_variable.

@dcherian
Copy link
Contributor

It has not been implemented.

@dcherian
Copy link
Contributor

We just have two "tie points" (the first and last point) and then linearly interpolate between them?

Yes I think so, that's why it clicked in my head. I don't know what you would do for all the other GRIB coordinate systems

@martindurant
Copy link
Member

We just have two "tie points"

This is also essentially the case in standard TIFF, but of course more complex geometries are possible in practice, and GRIB has many models.

@TomNicholas
Copy link

Does this instead belong in Zarr instead?

This seems somewhat similar to my FunctionalStore suggestion in zarr-developers/VirtualiZarr#238 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants