Replies: 3 comments
-
Good point, although this could be one of those things where it works better this way in some circumstances and the other way in other circumstances (like PostgreSQL versions, database size etc.). Anybody else got some experience with this? |
Beta Was this translation helpful? Give feedback.
-
I've experienced that outside of osm2pgsql a few years ago. We were organizing a hackathon with large health related dataset, and provided them thru a PG instance. Since then, I create my index in // on the same table as far as possible, then I process the next table... |
Beta Was this translation helpful? Give feedback.
-
I think the benefits in this case will be limited. From my experience using parallel indexing, until we finally have a much faster GIST indexing in PostgreSQL and PostGIS implemented, the ultimate thing determining total indexing time is largely down to indexing the spatial 'way' column. E.g. when I start 4 consecutive threads (each one potentially using PostgreSQL parallel indexing for btree indexes), and index 50+ columns including a single 'way' column, then the total indexing time is fully determined by the way indexing only (which I deliberately initiate on the first started thread). All other indexes together cost less time using the remaining 3 threads. Additionally, this scenario would only benefit in case you create a single table using osm2pgsql. As far as I can tell, osm2pgsql will already index multiple tables in parallel, so the benefits of parallel indexing on a single table will be less (and may hit limits of parallel workers in PostgreSQL). So this would likely only really make sense in the case of a style defining a single table. |
Beta Was this translation helpful? Give feedback.
-
It looks like index creation on one table are currently done sequentially:
https://github.com/openstreetmap/osm2pgsql/blob/b1a634a0ed3c0f15fce4d5c18e1eb46eab413e6c/src/table.cpp#L231
Creating index on a table means reading the whole table data sequentially, so creating several index on the same table benefits a lot of doing them in parallel, as data has high chance to still be in some cache and requires no additional read I/O.
When creating them sequentially on large tables do not benefit from read cache when it is smaller than the data to read (which is usually the case).
I'm usually creating additionnal index in parallel and it takes not much more than the longest one in most cases.
I have no idea if this is easy to implement or not...
Beta Was this translation helpful? Give feedback.
All reactions