Geometry in daft? #3017
Replies: 4 comments
-
Definitely agree that to be truly multimodal we need geospatial support! The fastest way to do this would be to bring in functionality via a third party library like GDAL or JTS. A slightly less ergonomic approach, that users should already be able to do, is pull in these libraries and use them in user defined functions. In this case, geospatial workloads would work, but not be particularly fast or stable (lots of potential blowups memory-wise unless the system is aware of it and plans accordingly). What I think would be a competitive advantage in terms of performance and stability, but which would happen much further down the road, would be native geospatial support. Native datatypes and internal representations, native kernels for geospatial expressions, native joins and indices on top of geospatial data. From experience (I was one of two engineers at Databricks working on geospatial), it's a whole can of worms that I don't think we can dedicate the effort to for the time being. |
Beta Was this translation helpful? Give feedback.
-
@desmondcheongzx, thanks! I totally agree that native support for geometry types is an ultimate goal, but also much heavier lift indeed. I thought that eventually geoarrow might be part of the solution there given daft integration with arrow. However, even then I wonder if the generic geometry type support might still be useful, its advantage being flexibility in representing multiple geometries and implementation-wise probably for a while will have more coverage in algorithms. Until such time as full native support would you see this type as a stop-gap? It seems more ergonomic, reusable, and probably more performant than python UDFs and simpler integration with SQL. The way I've done (demonstrated above) is relatively low friction by defining a |
Beta Was this translation helpful? Give feedback.
-
Oh!! I somehow misread your post and didn't realize you already had a working version up and running. This is awesome @amitschang! Backing Any interest in opening up a PR and collaborating further?
Definitely agree with this. I also wouldn't be opposed to further integration with other external geospatial libraries. For example, we already have a user using H3 indices, but while custom H3 implementation > external H3 library, it's also true that external H3 library > H3 in UDFs. |
Beta Was this translation helpful? Give feedback.
-
@desmondcheongzx, yeah, thats right! The working version I have is as you describe, basically the Geometry type is a "validated" wkb. To me this represents step 1, getting from text/wkb and some operations. Then later building out more operations and expanded I/O, e.g. adding geojson, shapefiles, etc. That is probably where more spatial library support could come in as well. I'd definitely be into making a PR, I wanted to make sure that this was even a reasonable approach and fits within the broader vision. Before a PR I should do some cleanup in terms of naming etc. If you are interested in working together on this it would be great to get feedback 😄 BTW here is where I have my drafting version https://github.com/amitschang/Daft/tree/geotype |
Beta Was this translation helpful? Give feedback.
-
Hey all! As someone who's occasionally had a use-case for geometry in dataframes (geospark/geopandas) it seems like Geometry types are a good fit also for Daft, given its existing support of multimodal data and its expression API in rust enabling efficient operations. A lot of projects seem to have separate packages or extensions for geometry support, but I wonder if it would be nice "out-of-the-box" with daft.
As it turns out, there are decent looking crates available for geometry(with geographic functions as well) both in pure rust and bindings for GEOS and PROJ (at https://github.com/georust). Since I'm also interested in learning more about the internals of daft, and contributing, I thought I'd go about adding a geometry type and enough operations to showcase how it works and learn something along the way.
Is this of interest?
Here is a real session as an example, starting with a parquet file with binary encoded geometry in a specific column: (adapted from data from https://github.com/geopandas/geodatasets):
The way this does it is similar (I think) to how duckdb does its non-native (simple features) geometry (https://github.com/duckdb/duckdb_spatial/blob/main/docs/internals.md), e.g. storing the well-known-binary (WKB) in the physical backing and doing serde on operation. Comes with a cost, but presumably this kind of geometry is flexible and would sit alongside native types (which would support only one geo type per column, iso mixed like this does).
Sorry for the long post! Thanks for looking 😄
Beta Was this translation helpful? Give feedback.
All reactions