Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add automatic OSM extracts download #33

Merged
merged 38 commits into from
Jan 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
7c5cdcf
feat: copy OSM indexes parsing logic and add BBBike
RaczeQ Jan 21, 2024
75254c4
chore: added license info
RaczeQ Jan 23, 2024
1cc368f
feat: add info about caching and geometries validation
RaczeQ Jan 23, 2024
67d3cf7
chore: add changelog entries
RaczeQ Jan 23, 2024
d3c2e61
fix: readme typo
RaczeQ Jan 24, 2024
a9713b6
feat: add osmnx geocoding
RaczeQ Jan 25, 2024
eb93971
feat: change CLI parameters parsing flow
RaczeQ Jan 27, 2024
5dfbd56
chore: modified osm extracts index naming
RaczeQ Jan 27, 2024
fd0d4ea
feat: add option to use any geometry in filtering and downloading osm…
RaczeQ Jan 27, 2024
6d6a655
refactor: moved OsmExtractSource enum
RaczeQ Jan 27, 2024
e98d4f2
feat: added function for automatic pbf files downloading and returnin…
RaczeQ Jan 27, 2024
f95a4df
chore: added required libraries
RaczeQ Jan 27, 2024
c9adf62
feat: add missing doctests for new function
RaczeQ Jan 29, 2024
8f3fb45
feat_ add new function to the cli
RaczeQ Jan 29, 2024
26aa926
ci: add OSM data caching
RaczeQ Jan 29, 2024
3ee1d62
fix: sort combined OSM indexes
RaczeQ Jan 29, 2024
a92eaa2
chore: add missing tests and option to pass osm ids separated by comma
RaczeQ Jan 29, 2024
9e18b72
ci: change cache names
RaczeQ Jan 29, 2024
36e3ea0
feat: add option to filter data using global grid indexes
RaczeQ Jan 29, 2024
1fc1c78
chore: suppress refurb suggestions
RaczeQ Jan 29, 2024
4187334
chore: add changelog entries
RaczeQ Jan 29, 2024
289b035
fix: add condition check for empty pyarrow table
RaczeQ Jan 29, 2024
2bf9f02
chore: add test for duplicated features
RaczeQ Jan 29, 2024
72d9acd
chore: add changelog entry
RaczeQ Jan 29, 2024
c3154f4
fix: remove trailing slash from BBBike url
RaczeQ Jan 30, 2024
a3039be
feat: add wrapper function to return gdf from geometry filter
RaczeQ Jan 30, 2024
6d85bf1
chore: add andorra parsing extract example
RaczeQ Jan 30, 2024
7128c42
chore: add osm extracts tests
RaczeQ Jan 30, 2024
89c1009
fix: add geometry orienting to simplify geometry hashing
RaczeQ Jan 30, 2024
a72c4f8
chore: add changelog entry
RaczeQ Jan 30, 2024
f627a70
refactor: apply refurb suggestion
RaczeQ Jan 30, 2024
c66fe06
chore: replace values in doctests
RaczeQ Jan 30, 2024
565da9c
chore: add rounding to geometry hashing
RaczeQ Jan 30, 2024
85f5ed8
fix: change hashes in tests
RaczeQ Jan 30, 2024
fb09251
chore: remove progress bars for extracts searching
RaczeQ Jan 30, 2024
9739d10
chore: change cli help image
RaczeQ Jan 30, 2024
83e7ecd
chore: suppress warnings and logger in CLI
RaczeQ Jan 30, 2024
73f4c37
chore: update readme file
RaczeQ Jan 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/workflows/_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,11 @@ jobs:
cache-dependency-path: "**/pdm.lock"
- name: Install dependencies
run: pdm install -d -G test --skip=post_install
- name: Cache OSM data
uses: actions/cache@v3
with:
path: cache
key: osm-cache-${{ matrix.os }}-${{ matrix.python-version }}
- name: Cache tox runner
uses: actions/cache@v3
with:
Expand Down
5 changes: 5 additions & 0 deletions .github/workflows/ci-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,11 @@ jobs:
run: |
git config --local user.email "github-actions[bot]@users.noreply.github.com"
git config --local user.name "github-actions[bot]"
- name: Cache OSM data
uses: actions/cache@v3
with:
path: "**/cache"
key: mkdocs-osm-dev-cache-${{ runner.os }}
- name: Execute jupyter notebooks
run: |
jupyter nbconvert --to notebook --inplace --execute $(find examples/ -type f -name "*.ipynb") --ExecutePreprocessor.kernel_name='python3'
Expand Down
5 changes: 5 additions & 0 deletions .github/workflows/ci-prod.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,11 @@ jobs:
run: |
git config --local user.email "github-actions[bot]@users.noreply.github.com"
git config --local user.name "github-actions[bot]"
- name: Cache OSM data
uses: actions/cache@v3
with:
path: "**/cache"
key: mkdocs-osm-dev-cache-${{ runner.os }}
- name: Execute jupyter notebooks
run: |
jupyter nbconvert --to notebook --inplace --execute $(find examples/ -type f -name "*.ipynb") --ExecutePreprocessor.kernel_name='python3'
Expand Down
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,20 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- Option to automatically download PBF files for geometries [#32](https://github.com/kraina-ai/quackosm/issues/32)
- Filtering data using 3 global grid systems: Geohash, H3 and S2 [#30](https://github.com/kraina-ai/quackosm/issues/30)

### Changed

- Filter OSM IDs are now expected to be passed after comma instead of repeating `--filter-osm-id` every time [#30](https://github.com/kraina-ai/quackosm/issues/30)

### Fixed

- Remove duplicated features when parsing multiple PBF files
- Geometry orienting to eliminate hash differences between operating systems and different equal versions of the same geometry

## [0.3.3] - 2024-01-16

### Added
Expand Down
171 changes: 166 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,10 @@ Optional:
- typer[all] (click, colorama, rich, shellingham)

## Usage
### Load data as a GeoDataFrame

### If you already have downloaded the PBF file 📁🗺️

#### Load data as a GeoDataFrame
```python
>>> import quackosm as qosm
>>> qosm.get_features_gdf(monaco_pbf_path)
Expand All @@ -72,14 +75,14 @@ way/993121275 {'building': 'yes', 'name': ... POLYGON ((7.43214 43.7481...

[7906 rows x 2 columns]
```
### Just convert PBF to GeoParquet
#### Just convert PBF to GeoParquet
```python
>>> import quackosm as qosm
>>> gpq_path = qosm.convert_pbf_to_gpq(monaco_pbf_path)
>>> gpq_path.as_posix()
'files/monaco_nofilter_noclip_compact.geoparquet'
```
### Inspect the file with duckdb
#### Inspect the file with duckdb
```python
>>> import duckdb
>>> duckdb.load_extension('spatial')
Expand Down Expand Up @@ -117,7 +120,7 @@ way/993121275 {'building': 'yes', 'name': ... POLYGON ((7.43214 43.7481...
│ 7906 rows (20 shown) 3 columns │
└────────────────────────────────────────────────────────────────────────────────────────┘
```
### Use as CLI
#### Use as CLI
```console
$ quackosm monaco.osm.pbf
⠏ [ 1/33] Reading nodes • 0:00:00
Expand Down Expand Up @@ -157,7 +160,122 @@ $ quackosm monaco.osm.pbf
⠙ [ 33/33] Saving final geoparquet file • 0:00:00
files/monaco_nofilter_noclip_compact.geoparquet
```
CLI Help output:

### Let the QuackOSM automatically download the required OSM PBF files for you 🔎🌍

#### Load data as a GeoDataFrame
```python
>>> import quackosm as qosm
>>> import osmnx as ox
>>> geometry = ox.geocode_to_gdf("Vatican City").unary_union
>>> qosm.get_features_gdf_from_geometry(geometry)
tags geometry
feature_id
node/10253371713 {'crossing': 'uncontrolled',... POINT (12.45603 41.90454)
node/10253371714 {'highway': 'stop'} POINT (12.45705 41.90400)
node/10253371715 {'highway': 'stop'} POINT (12.45242 41.90164)
node/10253371720 {'artwork_type': 'statue',... POINT (12.45147 41.90484)
node/10253371738 {'natural': 'tree'} POINT (12.45595 41.90609)
... ... ...
way/983015528 {'barrier': 'hedge', 'height'... POLYGON ((12.45027 41.901...
way/983015529 {'barrier': 'hedge', 'height'... POLYGON ((12.45028 41.901...
way/983015530 {'barrier': 'hedge', 'height'... POLYGON ((12.45023 41.901...
way/998561138 {'barrier': 'bollard', 'bicyc... LINESTRING (12.45821 41.9...
way/998561139 {'barrier': 'bollard', 'bicyc... LINESTRING (12.45828 41.9...

[3286 rows x 2 columns]
```
#### Just convert geometry to GeoParquet
```python
>>> import quackosm as qosm
>>> from shapely import from_wkt
>>> geometry = from_wkt(
... "POLYGON ((14.4861 35.9107, 14.4861 35.8811, 14.5331 35.8811, 14.5331 35.9107, 14.4861 35.9107))"
... )
>>> gpq_path = qosm.convert_geometry_to_gpq(geometry)
>>> gpq_path.as_posix()
'files/4b2967088a8fe31cdc15401e29bff9b7b882565cd8143e90443f39f2dc5fe6de_nofilter_compact.geoparquet'
```
#### Inspect the file with duckdb
```python
>>> import duckdb
>>> duckdb.load_extension('spatial')
>>> duckdb.read_parquet(str(gpq_path)).project(
... "* REPLACE (ST_GeomFromWKB(geometry) AS geometry)"
... ).order("feature_id")
┌──────────────────┬──────────────────────┬──────────────────────────────────────────────┐
│ feature_id │ tags │ geometry │
│ varchar │ map(varchar, varch… │ geometry │
├──────────────────┼──────────────────────┼──────────────────────────────────────────────┤
│ node/10001388317 │ {amenity=bench, ba… │ POINT (14.5093988 35.8936881) │
│ node/10001388417 │ {amenity=bench, ba… │ POINT (14.5094635 35.8937135) │
│ node/10001388517 │ {amenity=bench, ba… │ POINT (14.5095215 35.8937305) │
│ node/10018287160 │ {opening_hours=Mo-… │ POINT (14.5184916 35.8915925) │
│ node/10018287161 │ {defensive_works=b… │ POINT (14.5190093 35.8909471) │
│ node/10018287162 │ {defensive_works=h… │ POINT (14.5250094 35.8883199) │
│ node/10018742746 │ {defibrillator:loc… │ POINT (14.5094082 35.8965151) │
│ node/10018742747 │ {amenity=bank, nam… │ POINT (14.51329 35.8991614) │
│ node/10032244899 │ {amenity=restauran… │ POINT (14.4946298 35.8986226) │
│ node/10034853491 │ {amenity=pharmacy} │ POINT (14.4945884 35.9012758) │
│ · │ · │ · │
│ · │ · │ · │
│ · │ · │ · │
│ way/884730763 │ {highway=footway, … │ LINESTRING (14.5218277 35.8896022, 14.5218… │
│ way/884730764 │ {bridge=yes, highw… │ LINESTRING (14.5218054 35.8896015, 14.5218… │
│ way/884730765 │ {highway=footway, … │ LINESTRING (14.5204069 35.889924, 14.52044… │
│ way/884730766 │ {handrail=yes, hig… │ LINESTRING (14.5204375 35.8898663, 14.5204… │
│ way/884730767 │ {access=yes, handr… │ LINESTRING (14.5196113 35.8906142, 14.5196… │
│ way/884730768 │ {highway=steps, la… │ LINESTRING (14.5197226 35.890676, 14.51972… │
│ way/884730769 │ {access=yes, handr… │ LINESTRING (14.5197184 35.8906707, 14.5197… │
│ way/884738591 │ {highway=pedestria… │ LINESTRING (14.5204163 35.8897296, 14.5204… │
│ way/884744870 │ {highway=residenti… │ LINESTRING (14.5218931 35.8864046, 14.5221… │
│ way/884744871 │ {access=yes, handr… │ LINESTRING (14.5221083 35.8864287, 14.5221… │
├──────────────────┴──────────────────────┴──────────────────────────────────────────────┤
│ ? rows (>9999 rows, 20 shown) 3 columns │
└────────────────────────────────────────────────────────────────────────────────────────┘
```
#### Use as CLI
```console
$ quackosm --geom-filter-geocode "Shibuya, Tokyo"
100%|██████████████████████████████████████| 45.7M/45.7M [00:00<00:00, 259GB/s]
⠹ [ 1/33] Reading nodes • 0:00:03
⠇ [ 2/33] Filtering nodes - intersection • 0:00:01
⠇ [ 3/33] Filtering nodes - tags • 0:00:01
⠙ [ 4/33] Calculating distinct filtered nodes ids • 0:00:00
⠋ [ 5/33] Reading ways • 0:00:04
⠧ [ 6/33] Unnesting ways • 0:00:02
⠹ [ 7/33] Filtering ways - valid refs • 0:00:03
⠴ [ 8/33] Filtering ways - intersection • 0:00:02
⠼ [ 9/33] Filtering ways - tags • 0:00:00
⠋ [ 10/33] Calculating distinct filtered ways ids • 0:00:00
⠦ [ 11/33] Reading relations • 0:00:00
⠴ [ 12/33] Unnesting relations • 0:00:00
⠼ [ 13/33] Filtering relations - valid refs • 0:00:00
⠸ [ 14/33] Filtering relations - intersection • 0:00:00
⠙ [ 15/33] Filtering relations - tags • 0:00:00
⠋ [ 16/33] Calculating distinct filtered relations ids • 0:00:00
⠹ [ 17/33] Loading required ways - by relations • 0:00:00
⠋ [ 18/33] Calculating distinct required ways ids • 0:00:00
⠇ [ 19/33] Saving filtered nodes with geometries • 0:00:02
⠋ [ 20/33] Saving required nodes with structs • 0:00:05
⠧ [ 21/33] Grouping filtered ways • 0:00:03
[ 22/33] Saving filtered ways with linestrings 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 • 0:00:00 < 0:00:00 •
⠴ [ 23/33] Grouping required ways • 0:00:02
[ 24/33] Saving required ways with linestrings 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 • 0:00:00 < 0:00:00 •
⠹ [ 25/33] Saving filtered ways with geometries • 0:00:01
⠦ [ 26/33] Saving valid relations parts • 0:00:00
⠙ [27.1/33] Saving relations inner parts - valid geometries • 0:00:00
⠙ [27.2/33] Saving relations inner parts - invalid geometries • 0:00:00
⠙ [28.1/33] Saving relations outer parts - valid geometries • 0:00:00
⠙ [28.2/33] Saving relations outer parts - invalid geometries • 0:00:00
⠋ [ 29/33] Saving relations outer parts with holes • 0:00:00
⠙ [ 30/33] Saving relations outer parts without holes • 0:00:00
⠙ [ 31/33] Saving filtered relations with geometries • 0:00:00
⠼ [32.1/33] Saving valid features • 0:00:00
⠋ [ 33/33] Saving final geoparquet file • 0:00:00
files/9ae2b160eb7556991148f5a2693aaf4b38bbb225c3700a6bfe9e5e54f48b987e_nofilter_compact.geoparquet
```
CLI Help output (`QuackOSM -h`):
![CLI Help output](https://raw.githubusercontent.com/kraina-ai/quackosm/main/docs/assets/images/cli_help.png)

You can find full API + more examples in the [docs](https://kraina-ai.github.io/quackosm/).
Expand Down Expand Up @@ -194,6 +312,42 @@ Library contains a logic to construct geometries (points, linestrings, polygons)
11. Fix invalid geometries
12. Return final GeoParquet file

### Geometry validation

You might ask a question: _How do I know that these geometries are reconstructed correctly?_

To answer this question, the `QuackOSM` has implemented dedicated tests that validate the results of `GDAL` geometries vs `QuackOSM`.
This might come as a surprise, but since OSM geometries aren't always perfectly defined (especially relations), the `QuackOSM` can even fix geometries that are loaded with weird artifacts by `GDAL`.

You can inspect the comparison algorithm in the `test_gdal_parity` function from `tests/base/test_pbf_file_reader.py` file.

### Caching

Library utilizes caching system to reduce repeatable computations.

By default, the library is saving results in the `files` directory created in the working directory. Result file name is generated based on the original `*.osm.pbf` file name.

Original file name to be converted: `example.osm.pbf`.

Default output without any filtering: `example_nofilter_noclip_compact.geoparquet`.

The nofilter part can be replaced by the hash of OSM tags provided for filtering.
`example_a9dd1c3c2e3d6a94354464e9a1a536ef44cca77eebbd882f48ca52799eb4ca91_noclip_exploded.geoparquet`

The noclip part can be replaced by the hash of geometry used for filtering.
`example_nofilter_430020b6b1ba7bef8ea919b2fb4472dab2972c70a2abae253760a56c29f449c4_compact.geoparquet`

The `compact` part can also take the form of `exploded`, it represents the form of OSM tags - either kept together in a single dictionary or split into columns.

When filtering by selecting individual features IDs, an additional hash based on those IDs is appended to the file.
`example_nofilter_noclip_compact_c740a1597e53ae8c5e98c5119eaa1893ddc177161afe8642addcbe54a6dc089d.geoparquet`

When the `keep_all_tags` parameter is passed while filtering by OSM tags, and additional `alltags` component is added after the osm filter hash part.
`example_a9dd1c3c2e3d6a94354464e9a1a536ef44cca77eebbd882f48ca52799eb4ca91_alltags_noclip_compact.geoparquet`

General schema of multiple segments that are concatenated together:
`pbf_file_name`\_(`osm_filter_tags_hash_part`/`nofilter`)(\_`alltags`)\_(`clipping_geometry_hash_part`/`noclip`)\_(`compact`/`exploded`)(\_`filter_osm_ids_hash_part`).geoparquet

### Memory usage

DuckDB queries requiring `JOIN`, `GROUP` and `ORDER BY` operations are very memory intensive. Because of that, some steps are divided into chunks (groups) with a set number of rows per chunk.
Expand Down Expand Up @@ -236,3 +390,10 @@ PBF file size: 1.7 GB
[Geofabrik link](https://download.geofabrik.de/europe/poland.html)

![Poland PBF file result](https://raw.githubusercontent.com/kraina-ai/quackosm/main/docs/assets/images/poland_disk_spillage.png)


## License

The library is distributed under Apache-2.0 License.

The free [OpenStreetMap](https://www.openstreetmap.org/) data, which is used for the development of QuackOSM, is licensed under the [Open Data Commons Open Database License](https://opendatacommons.org/licenses/odbl/) (ODbL) by the [OpenStreetMap Foundation](https://osmfoundation.org/) (OSMF).
Binary file modified docs/assets/images/cli_help.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading