-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): Replace geopandas.GeoJSONDataset with geopandas.GenericDataset #812
feat(datasets): Replace geopandas.GeoJSONDataset with geopandas.GenericDataset #812
Conversation
711093b
to
af6a1d3
Compare
If the implementation is okay, I can add the feather dataset in the same PR or open a new one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution @harm-matthias-harms ! One question: do you think we could have some sort of geopandas.GenericDataset
, like we do for pandas and Polars? (Just so that we avoid proliferation of datasets)
To be honest, that would be my preferred solution as well. The current approach feels like a lot of duplicated boilerplate. I really like the Polars dataset and will try to modify the GeoJSON dataset tomorrow to make it work more generally. |
8e79972
to
04ff6a2
Compare
|
||
def __init__( # noqa: PLR0913 | ||
self, | ||
*, | ||
filepath: str, | ||
file_format: str = "file", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to set a default value, especially because this ensures backward compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this anymore because of backward compatibility, but it's more or less the generic method of geopandas, and it saves some boilerplate in most cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @harm-matthias-harms! This looks like a step in the right direction.
My worry is that GeoJSONDataset
is no longer GeoJSON-exclusive. So I'm wondering if we should more or less take the code you created here, but put it in a geopandas.GenericGeoDataset
, and leave geopandas.GeoJSONDataset
alone.
@astrojuanlu I see two options:
Please tell me how you would like to proceed here. |
Hah, true... I see this wasn't discussed in kedro-org/kedro#190 originally. I'm actually in favour of having a new dataset with a more generic name and deprecate the old one, but wondering if the churn is worth the advantages. Any thoughts @merelcht @noklam ? |
We can just remove the old one and add a new one with a generic name in a breaking release. I don't think it's worth the effort of doing a non-breaking release with a deprecation warning first and then remove the old one in the next breaking release just for this one dataset. We could do a TSC vote if we feel the nature of the dataset changes too much. |
Let's proceed that way then 👍🏼 @harm-matthias-harms are you willing to update the PR accordingly? Namely:
|
@astrojuanlu I updated the PR. Since I merged main the tests have been failing. It looks like a fsspec problem. I don't know much about ffspec, maybe you have a good idea of how to fix this. Last successful run: https://github.com/kedro-org/kedro-plugins/actions/runs/10593320929/job/29354563096?pr=812 |
I do not know. Does geopandas handle remote filepaths? Like, if we do will it work? |
@astrojuanlu This was also a thought I had, but that currently makes more tests fail, and it may be better to have all datasets similar. I don't know why the CI fails.
|
Thanks for your patience! Let us know how it goes |
It has to do with the docstring part and seems to affect python 3.10 specifically. I'm able to replicate the error locally, but haven't found what's wrong yet. |
@astrojuanlu I figured out the problem... I also bumped geopandas to v1. They use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution, @harm-matthias-harms!
Left a few nit comments, but overall it looks good!
4fad046
to
ba2b3f5
Compare
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Co-authored-by: ElenaKhaustova <[email protected]> Signed-off-by: Harm Matthias Harms <[email protected]> Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
b789b01
to
aba6a28
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @harm-matthias-harms ⭐
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing the comments!
@astrojuanlu, do you wanna add anything before we merge? |
Signed-off-by: Ankita Katiyar <[email protected]>
Description
Geopandas supports reading parquet and feather files since version 0.8.0. These offer performance improvements over classical file types, such as
.geojson
or.shp.zip
(supported bygeopandas.GeoJSONDataset
). We have used a private implementation for some time, but want to contribute them tokedro-datasets
. This PR closes #196.Development notes
Extended the implementation of
geopandas.GeoJSONDataset
with an optionalfile_format
parameter which is needed to use customread_*
andto_*
methods. Replaced thegeopandas.GeoJSONDataset
withgeopandas.GenericDataset
.Checklist
RELEASE.md
file