-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate Geospatial Queries work in Data Commons environment #226
Comments
Suggested initial approach is to look at leveraging TileDB and Trino connector as discussed with @joemoorhouse Data format used: ZAR Location of test data: Zarr data can be found in S3 bucket redhat-osc-physical-landing-647521352890, object keys starting hazard/hazard.zarr TileDB: TileDB Trino Connector: |
As mentioned on Slack, TileDB is aiming to provide a major update in the March timeframe. That should not slow us down in terms of evaluating connector/tooling readiness, but should inform resourcing so that when it is available, people have the best available version for most of the year. |
I guess there are two approaches
In terms of common formats offering chunked, compressed storage there is also (Cloud Optimised) GeoTIFF and NetCDF - but both Zarr and TileDB split chunks into separate objects, which is desirable. Both seem good choices for us. Between the two, Zarr is an endorsed(?) community standard (https://latlong.blog/2022/07/ogc-endorses-zarr-2-0-community-standard.html) and seems more widely used (subjective). But then TileDB already has a Trino connector! Support-wise, xarray has great support for Zarr, but on the other hand I see that TileDB has also added support https://github.com/TileDB-Inc/TileDB-CF-Py, so less of a difference there I think. They both play nicely with Dask/xarray. There are Java libraries for Zarr reading/writing, but still a decent amount of work to wrap these into a Trino connector I would think. |
By the way, honourable mention for H3. RiskThinking showed that this works (and looks) great, certainly for moderately-sized data sets and I see folks working on approaches to optimise for high-resolution data sets: |
Need to verify that Trino and other OS-C components will support
The text was updated successfully, but these errors were encountered: