Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Replace geopandas.GeoJSONDataset with geopandas.GenericDataset #812

Conversation

harm-matthias-harms
Copy link
Contributor

@harm-matthias-harms harm-matthias-harms commented Aug 21, 2024

Description

Geopandas supports reading parquet and feather files since version 0.8.0. These offer performance improvements over classical file types, such as .geojson or .shp.zip (supported by geopandas.GeoJSONDataset). We have used a private implementation for some time, but want to contribute them to kedro-datasets. This PR closes #196.

Development notes

Extended the implementation of geopandas.GeoJSONDataset with an optional file_formatparameter which is needed to use custom read_* and to_* methods. Replaced the geopandas.GeoJSONDataset with geopandas.GenericDataset.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

@harm-matthias-harms harm-matthias-harms force-pushed the feature/add_geopandas_parquet_dataset branch from 711093b to af6a1d3 Compare August 21, 2024 13:41
@harm-matthias-harms
Copy link
Contributor Author

If the implementation is okay, I can add the feather dataset in the same PR or open a new one.

@harm-matthias-harms harm-matthias-harms marked this pull request as ready for review August 21, 2024 14:34
Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution @harm-matthias-harms ! One question: do you think we could have some sort of geopandas.GenericDataset, like we do for pandas and Polars? (Just so that we avoid proliferation of datasets)

@harm-matthias-harms
Copy link
Contributor Author

Thanks for this contribution @harm-matthias-harms ! One question: do you think we could have some sort of geopandas.GenericDataset, like we do for pandas and Polars? (Just so that we avoid proliferation of datasets)

To be honest, that would be my preferred solution as well. The current approach feels like a lot of duplicated boilerplate. I really like the Polars dataset and will try to modify the GeoJSON dataset tomorrow to make it work more generally.

@harm-matthias-harms harm-matthias-harms marked this pull request as draft August 21, 2024 15:53
@harm-matthias-harms harm-matthias-harms changed the title feat(datasets): Add geopandas ParquetDataset feat(datasets): Add file_format to geopandas.GeoJSONDataset Aug 22, 2024
@harm-matthias-harms harm-matthias-harms force-pushed the feature/add_geopandas_parquet_dataset branch 2 times, most recently from 8e79972 to 04ff6a2 Compare August 22, 2024 10:10
kedro-datasets/kedro_datasets/geopandas/geojson_dataset.py Outdated Show resolved Hide resolved

def __init__( # noqa: PLR0913
self,
*,
filepath: str,
file_format: str = "file",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to set a default value, especially because this ensures backward compatibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this anymore because of backward compatibility, but it's more or less the generic method of geopandas, and it saves some boilerplate in most cases.

@harm-matthias-harms harm-matthias-harms marked this pull request as ready for review August 22, 2024 10:45
Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @harm-matthias-harms! This looks like a step in the right direction.

My worry is that GeoJSONDataset is no longer GeoJSON-exclusive. So I'm wondering if we should more or less take the code you created here, but put it in a geopandas.GenericGeoDataset, and leave geopandas.GeoJSONDataset alone.

@harm-matthias-harms
Copy link
Contributor Author

My worry is that GeoJSONDataset is no longer GeoJSON-exclusive. So I'm wondering if we should more or less take the code you created here, but put it in a geopandas.GenericGeoDataset, and leave geopandas.GeoJSONDataset alone.

@astrojuanlu I see two options:

  1. GeoJSON is also not GeoJSON exclusive today because I can read many more formats than geojson with read_file.
  2. We create a GenericDataset (this fixes the naming and offers the same functionality and more ) and deprecate the GeoJSONDataset to remove it in some future major versions. Then we don't have to maintain the same dataset twice in the future.

Please tell me how you would like to proceed here.

@astrojuanlu
Copy link
Member

GeoJSON is also not GeoJSON exclusive today because I can read many more formats than geojson with read_file.

Hah, true... I see this wasn't discussed in kedro-org/kedro#190 originally.

I'm actually in favour of having a new dataset with a more generic name and deprecate the old one, but wondering if the churn is worth the advantages. Any thoughts @merelcht @noklam ?

@merelcht
Copy link
Member

merelcht commented Sep 3, 2024

I'm actually in favour of having a new dataset with a more generic name and deprecate the old one, but wondering if the churn is worth the advantages. Any thoughts @merelcht @noklam ?

We can just remove the old one and add a new one with a generic name in a breaking release. I don't think it's worth the effort of doing a non-breaking release with a deprecation warning first and then remove the old one in the next breaking release just for this one dataset.

We could do a TSC vote if we feel the nature of the dataset changes too much.

@astrojuanlu
Copy link
Member

Let's proceed that way then 👍🏼 @harm-matthias-harms are you willing to update the PR accordingly? Namely:

just remove the old one and add a new one with a generic name in a breaking release.

@harm-matthias-harms
Copy link
Contributor Author

@astrojuanlu I updated the PR. Since I merged main the tests have been failing. It looks like a fsspec problem. I don't know much about ffspec, maybe you have a good idea of how to fix this.

Last successful run: https://github.com/kedro-org/kedro-plugins/actions/runs/10593320929/job/29354563096?pr=812
First run that failed: https://github.com/kedro-org/kedro-plugins/actions/runs/10719091250/job/29722424013?pr=812

@harm-matthias-harms harm-matthias-harms changed the title feat(datasets): Add file_format to geopandas.GeoJSONDataset feat(datasets): Replace geopandas.GeoJSONDataset with geopandas.GenericDataset Sep 9, 2024
@astrojuanlu
Copy link
Member

I do not know.

Does geopandas handle remote filepaths? Like, if we do geopandas.read_{self._file_format}("s3://...") directly without

https://github.com/harm-matthias-harms/kedro-plugins/blob/9e88c5ee0284865b1256a5667bd6f7b691ddb3f7/kedro-datasets/kedro_datasets/geopandas/generic_dataset.py#L155-L156

will it work?

@harm-matthias-harms
Copy link
Contributor Author

@astrojuanlu This was also a thought I had, but that currently makes more tests fail, and it may be better to have all datasets similar. I don't know why the CI fails.

pytest kedro_datasets/geopandas/generic_dataset.py --doctest-modules --doctest-continue-on-failure and pytest tests/geopandas/test_generic_dataset.py both pass locally with python 3.12. It looks like I have to investigate the CI a bit more. I hope I find some time at the end of the week for that.

@astrojuanlu
Copy link
Member

Thanks for your patience! Let us know how it goes

@merelcht
Copy link
Member

@astrojuanlu This was also a thought I had, but that currently makes more tests fail, and it may be better to have all datasets similar. I don't know why the CI fails.

pytest kedro_datasets/geopandas/generic_dataset.py --doctest-modules --doctest-continue-on-failure and pytest tests/geopandas/test_generic_dataset.py both pass locally with python 3.12. It looks like I have to investigate the CI a bit more. I hope I find some time at the end of the week for that.

It has to do with the docstring part and seems to affect python 3.10 specifically. I'm able to replicate the error locally, but haven't found what's wrong yet.

@harm-matthias-harms
Copy link
Contributor Author

@astrojuanlu I figured out the problem... I also bumped geopandas to v1. They use pyogrio as the new default over fiona. pyogrio does not yet support an alternate file system like fsspec (geopandas/pyogrio#430). There are workarounds using buffers, but the simplest way would be to change the used engine to fiona until pyogrio supports fsspec.

Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution, @harm-matthias-harms!

Left a few nit comments, but overall it looks good!

kedro-datasets/kedro_datasets/geopandas/generic_dataset.py Outdated Show resolved Hide resolved
kedro-datasets/kedro_datasets/geopandas/__init__.py Outdated Show resolved Hide resolved
@merelcht merelcht mentioned this pull request Sep 30, 2024
@harm-matthias-harms harm-matthias-harms force-pushed the feature/add_geopandas_parquet_dataset branch 3 times, most recently from 4fad046 to ba2b3f5 Compare October 1, 2024 09:10
harm-matthias-harms and others added 13 commits October 7, 2024 08:27
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Co-authored-by: ElenaKhaustova <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
Signed-off-by: Harm Matthias Harms <[email protected]>
@harm-matthias-harms harm-matthias-harms force-pushed the feature/add_geopandas_parquet_dataset branch from b789b01 to aba6a28 Compare October 7, 2024 06:31
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @harm-matthias-harms

Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing the comments!

@ElenaKhaustova
Copy link
Contributor

@astrojuanlu, do you wanna add anything before we merge?

@ElenaKhaustova ElenaKhaustova enabled auto-merge (squash) October 10, 2024 13:57
Signed-off-by: Ankita Katiyar <[email protected]>
@ankatiyar ankatiyar enabled auto-merge (squash) October 10, 2024 15:49
@ankatiyar ankatiyar merged commit f13dd7a into kedro-org:main Oct 10, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support parquet and feather datasets from GeoPandas
7 participants