Parallel index creation, to minimize required time (and I/O) #1562

cquest · 2021-08-31T20:23:39Z

cquest
Aug 31, 2021

It looks like index creation on one table are currently done sequentially:

geometry
osm_id (if slim mode which I suppose is a very common case)
hstore (if asked for)

https://github.com/openstreetmap/osm2pgsql/blob/b1a634a0ed3c0f15fce4d5c18e1eb46eab413e6c/src/table.cpp#L231

Creating index on a table means reading the whole table data sequentially, so creating several index on the same table benefits a lot of doing them in parallel, as data has high chance to still be in some cache and requires no additional read I/O.
When creating them sequentially on large tables do not benefit from read cache when it is smaller than the data to read (which is usually the case).

I'm usually creating additionnal index in parallel and it takes not much more than the longest one in most cases.

I have no idea if this is easy to implement or not...

joto · 2021-08-31T20:47:47Z

joto
Aug 31, 2021
Maintainer

Good point, although this could be one of those things where it works better this way in some circumstances and the other way in other circumstances (like PostgreSQL versions, database size etc.). Anybody else got some experience with this?

0 replies

cquest · 2021-08-31T22:18:01Z

cquest
Aug 31, 2021
Author

I've experienced that outside of osm2pgsql a few years ago.

We were organizing a hackathon with large health related dataset, and provided them thru a PG instance.
Billions of lines of CSV to index with many index.
Each table was larger than the machine RAM, so creating on index after the other was re-reading everyhting from HDD.
If I remember 1 index = 15mn, but 10 index in // was only 20 mn :)

Since then, I create my index in // on the same table as far as possible, then I process the next table...

0 replies

mboeringa · 2021-09-21T19:25:52Z

mboeringa
Sep 21, 2021

I think the benefits in this case will be limited. From my experience using parallel indexing, until we finally have a much faster GIST indexing in PostgreSQL and PostGIS implemented, the ultimate thing determining total indexing time is largely down to indexing the spatial 'way' column.

E.g. when I start 4 consecutive threads (each one potentially using PostgreSQL parallel indexing for btree indexes), and index 50+ columns including a single 'way' column, then the total indexing time is fully determined by the way indexing only (which I deliberately initiate on the first started thread). All other indexes together cost less time using the remaining 3 threads.

Additionally, this scenario would only benefit in case you create a single table using osm2pgsql. As far as I can tell, osm2pgsql will already index multiple tables in parallel, so the benefits of parallel indexing on a single table will be less (and may hit limits of parallel workers in PostgreSQL).

So this would likely only really make sense in the case of a style defining a single table.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel index creation, to minimize required time (and I/O) #1562

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Parallel index creation, to minimize required time (and I/O) #1562

cquest Aug 31, 2021

Replies: 3 comments

joto Aug 31, 2021 Maintainer

cquest Aug 31, 2021 Author

mboeringa Sep 21, 2021

cquest
Aug 31, 2021

joto
Aug 31, 2021
Maintainer

cquest
Aug 31, 2021
Author

mboeringa
Sep 21, 2021