Validate Geospatial Queries work in Data Commons environment #226

HeatherAck · 2022-10-31T17:59:36Z

Need to verify that Trino and other OS-C components will support

caldeirav · 2023-02-14T02:11:04Z

Suggested initial approach is to look at leveraging TileDB and Trino connector as discussed with @joemoorhouse

Data format used: ZAR
https://zarr.readthedocs.io/en/stable/getting_started.html
https://pangeo-data.github.io/pangeo-cmip6-cloud/overview.html

Location of test data: Zarr data can be found in S3 bucket redhat-osc-physical-landing-647521352890, object keys starting hazard/hazard.zarr

TileDB:
https://github.com/TileDB-Inc/TileDB
@erikerlandson is the way forward having the TileDB libs into a PhysRisk developer image with all the tooling required?

TileDB Trino Connector:
https://github.com/TileDB-Inc/TileDB-Trino

MichaelTiemannOSC · 2023-02-14T09:21:45Z

As mentioned on Slack, TileDB is aiming to provide a major update in the March timeframe. That should not slow us down in terms of evaluating connector/tooling readiness, but should inform resourcing so that when it is available, people have the best available version for most of the year.

joemoorhouse · 2023-02-14T15:50:11Z

I guess there are two approaches

Create a Trino connector for Zarr
Try out Trino connector for TileDB and check TileDB can be made to fit our use case Allow arbitrary multi-range subarray slicing TileDB-Inc/TileDB#3076

In terms of common formats offering chunked, compressed storage there is also (Cloud Optimised) GeoTIFF and NetCDF - but both Zarr and TileDB split chunks into separate objects, which is desirable. Both seem good choices for us. Between the two, Zarr is an endorsed(?) community standard (https://latlong.blog/2022/07/ogc-endorses-zarr-2-0-community-standard.html) and seems more widely used (subjective). But then TileDB already has a Trino connector!

Support-wise, xarray has great support for Zarr, but on the other hand I see that TileDB has also added support https://github.com/TileDB-Inc/TileDB-CF-Py, so less of a difference there I think. They both play nicely with Dask/xarray.

There are Java libraries for Zarr reading/writing, but still a decent amount of work to wrap these into a Trino connector I would think.
https://github.com/zarr-developers/zarr_implementations

joemoorhouse · 2023-02-14T16:02:20Z

By the way, honourable mention for H3. RiskThinking showed that this works (and looks) great, certainly for moderately-sized data sets and I see folks working on approaches to optimise for high-resolution data sets:
https://medium.com/foursquare-direct/hex-tiles-building-a-new-data-tiling-system-with-h3-61eb33fed4cb
I personally prefer regular rectangular-grid rasters for the rather prosaic reason that our inputs tend to be in that form and we don't have to re-interpolate (no additional approximation needed).

HeatherAck assigned erikerlandson Nov 14, 2022

caldeirav assigned maknop Feb 10, 2023

caldeirav added geospatial data exchange labels Feb 14, 2023

maknop removed their assignment May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate Geospatial Queries work in Data Commons environment #226

Validate Geospatial Queries work in Data Commons environment #226

HeatherAck commented Oct 31, 2022

caldeirav commented Feb 14, 2023

MichaelTiemannOSC commented Feb 14, 2023

joemoorhouse commented Feb 14, 2023

joemoorhouse commented Feb 14, 2023

Validate Geospatial Queries work in Data Commons environment #226

Validate Geospatial Queries work in Data Commons environment #226

Comments

HeatherAck commented Oct 31, 2022

caldeirav commented Feb 14, 2023

MichaelTiemannOSC commented Feb 14, 2023

joemoorhouse commented Feb 14, 2023

joemoorhouse commented Feb 14, 2023