Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate Geospatial Queries work in Data Commons environment #226

Open
HeatherAck opened this issue Oct 31, 2022 · 4 comments
Open

Validate Geospatial Queries work in Data Commons environment #226

HeatherAck opened this issue Oct 31, 2022 · 4 comments

Comments

@HeatherAck
Copy link
Contributor

Need to verify that Trino and other OS-C components will support

@caldeirav
Copy link
Contributor

Suggested initial approach is to look at leveraging TileDB and Trino connector as discussed with @joemoorhouse

Data format used: ZAR
https://zarr.readthedocs.io/en/stable/getting_started.html
https://pangeo-data.github.io/pangeo-cmip6-cloud/overview.html

Location of test data: Zarr data can be found in S3 bucket redhat-osc-physical-landing-647521352890, object keys starting hazard/hazard.zarr

TileDB:
https://github.com/TileDB-Inc/TileDB
@erikerlandson is the way forward having the TileDB libs into a PhysRisk developer image with all the tooling required?

TileDB Trino Connector:
https://github.com/TileDB-Inc/TileDB-Trino

@MichaelTiemannOSC
Copy link
Contributor

As mentioned on Slack, TileDB is aiming to provide a major update in the March timeframe. That should not slow us down in terms of evaluating connector/tooling readiness, but should inform resourcing so that when it is available, people have the best available version for most of the year.

@joemoorhouse
Copy link

I guess there are two approaches

  1. Create a Trino connector for Zarr
  2. Try out Trino connector for TileDB and check TileDB can be made to fit our use case Allow arbitrary multi-range subarray slicing TileDB-Inc/TileDB#3076

In terms of common formats offering chunked, compressed storage there is also (Cloud Optimised) GeoTIFF and NetCDF - but both Zarr and TileDB split chunks into separate objects, which is desirable. Both seem good choices for us. Between the two, Zarr is an endorsed(?) community standard (https://latlong.blog/2022/07/ogc-endorses-zarr-2-0-community-standard.html) and seems more widely used (subjective). But then TileDB already has a Trino connector!

Support-wise, xarray has great support for Zarr, but on the other hand I see that TileDB has also added support https://github.com/TileDB-Inc/TileDB-CF-Py, so less of a difference there I think. They both play nicely with Dask/xarray.

There are Java libraries for Zarr reading/writing, but still a decent amount of work to wrap these into a Trino connector I would think.
https://github.com/zarr-developers/zarr_implementations

@joemoorhouse
Copy link

By the way, honourable mention for H3. RiskThinking showed that this works (and looks) great, certainly for moderately-sized data sets and I see folks working on approaches to optimise for high-resolution data sets:
https://medium.com/foursquare-direct/hex-tiles-building-a-new-data-tiling-system-with-h3-61eb33fed4cb
I personally prefer regular rectangular-grid rasters for the rather prosaic reason that our inputs tend to be in that form and we don't have to re-interpolate (no additional approximation needed).

@maknop maknop removed their assignment May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

6 participants