Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix remaining memory issues during the last 0.5% of processing planet.osm with the dense memory cache #108

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

patrickbr
Copy link
Member

@patrickbr patrickbr commented Dec 20, 2024

This is an unfinished draft PR to fix the remaining memory issues in the final 0.5% of phase 2 (dump) when processing planet.osm with the dense RAM cache.

The current version attempts to further reduce the memory and disk space overhead of calculating the geospatial relations.

libspatialjoin requires string IDs to identify geometries. Previously, the full prefixed IRI (e.g. osmnode:3454354544) was used as an ID. With 492069c, we use the more efficient schema

[type][integer ID]

where [type] is a single byte identifier (1 for nodes, 2 for ways, 3 for way areas, 4 for relation area, and 5 for relations), and [integer id] is the OSM ID encoded as a variable-length integer. These bytes are directly stored in a std::string used as the ID for spatialjoin. spatialjoin never interprets these IDs, so they can contain arbitrary byte sequences. In particular, note that libspatialjoin stores them as std::strings, not C strings, so 0 bytes are allowed and will be returned from the cache correctly (otherwise, IDs like 128 would not work).

This has the following benefits:

  1. the ID size of a typical OSM ID goes down from around 18 bytes to around 5 bytes
  2. the disk space required for the geometry cache (where these IDs are stored in) is reduced, especially for nodes (where ONLY the id is stored on disk)
  3. the memory overhead of the relation tracking (via libspatialjoins "collection of geometry refs" functionality) is greatly reduced.

For a further reduction of the memory footprint in the final 0.5%, I am waiting for profiling results from massif.

… nodes during computation of the geometric relations
…le-length integer, store this integer in a string, and use this string as an ID for libspatialjoin. This makes use of the fact that our IDs are always pairs of <type> and <int> and has the following benefits: (1) the ID size of a typical OSM id goes down from around 18 bytes to around 5 bytes, (2) the disk space required for the geometry cache (where these IDs are stored in) is reduced, especially for nodes (where ONLY the id is stored on disk), the memory overhead of the relation tracking (via libspatialjoins "collection of geometry refs" functionality) is greatly reduced. Note: for further optimiziations regarding the memory problem with the dense RAM cache, I am waiting for profiling results from massif
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant