Parsing PBF Files to Prove a Point
2024-10-01

A while ago I wrote a tool named osmar for querying open streetmap (OSM) databases from the command line. It just ran some SQL queries and printed the results in a line based format so that grep and other UNIX tools could be used with it.

It worked well and even gained some attention when it was mentioned on hackernews, but I had a problem with it.

Prelude: The Pain of Dealing with OSM Databases

Before using osmar, I had to set up a PostgreSQL database. It goes something like this:

cd ~/Databases
wget 'https://download.geofabrik.de/europe/serbia-latest.osm.pbf'
initdb -D osm_serbia
postgres -D osm_serbia
createdb gis
psql -d gis -c 'CREATE EXTENSION postgis;'
osm2pgsql --create --database gis serbia-latest.osm.pbf

This roughly takes 5min (3:30min are spent waiting for osm2pgsql) and converts the 193MiB PBF file into a 2.4GiB database. Now that might not sound too bad and one would think that this is a one-time effort, but over the years I had to learn that PostgreSQL likes to require database migrations with updates, I don't want to waste gigabytes of disk space for a tool that I use once in a while and whenever I needed the tool, I had forgotten how to start the database and had to look it up again.

This led to me using the tool less and less, because I'm not really willing to fumble around with PostgreSQL whenever I just quickly want to find a new restaurant or table tennis spot. I'd just go back to Google Maps and hope its algorithm didn't leave out anything interesting.

A new Idea

While being frustrated with my creation, a new idea recently struck me: What if I could just parse the PBF files directly? I have already seen JSON parsers that read gigabytes in a second and CSV parsers that evaluate over 10GiB of data in under 2s, so how hard could it be to parse a few hundred megabytes of PBF data?

The prospect of a simpler version of osmar or even an easy to use library for simple end-user tools clasped me. I felt the desire to prove, that modern hardware is good enough to just forget about big databases and lengthy data conversion processes. At least when performance is not the primary goal. So I fired up my text editor and set out to create some software.

Now I'm no SIMD wizard and I intended to use Go, which is not the best language for pure performance, but I wanted to see how far I can get with some good libraries and some not-too-crazy optimizations.

The Prototype

I dug right into the documentation of the PBF file format, which is the most common exchange format for OSM data and for which data extracts are easily available at geofabrik.de. It didn't seem too complicated: It's a file containing (most commonly) zlib compressed blobs, which contains Protobuf of OSM entities (nodes, ways and relations).

So I put myself to work and wrote a library for retrieving these entities from PBF files while filtering them by location and tags. I didn't focus much on performance and threw together a first prototype within a weekend. Eager to see what it can do, I adapted osmar to use the new library and took it for a ride, searching for bicycle shops in the city center of Bremen, a city for which a tiny 19.3MiB extract was available:

$ export OSMAR_PBF_FILE=/tmp/bremen-latest.osm.pbf
$ time osmar 53.076 8.807 500 shop=bicycle > /dev/null
1.435 real  1.24s user  0.19s sys

1.4s, not too bad. Not great, but good enough for a few searches here and there. Let's see how it does with a bigger 234MiB extract of the german state Saxony:

$ export OSMAR_PBF_FILE=/tmp/sachsen-latest.osm.pbf
$ time osmar 51.340 12.374 600 shop=bicycle > /dev/null
17.012 real  14.98s user  2.08s sys

17s, now that's not great at all. I don't see myself using the tool if it takes that long and 234MiB is not even that big of an extract. Let's see what a few performance optimizations can do for us.

Optimization #1: Parallelization

The obvious first optimization is to use all the cores of my multi-core CPU. That should be relatively easy to do, because the PBF file is already split into individually serialized and compressed blobs. Each core can decompress and deserialize its own blobs. There is only one caveat: The order in which the blobs are processed must remain the same, as I had used the fact that nodes come before ways and ways come before relations in the PBF files in order to save memory. I couldn't find a library for such a "serial worker pool", so wrote one myself. Here are the results:

$ time osmar 51.340 12.374 600 shop=bicycle > /dev/null
6.236 real  19.06s user  3.50s sys

6s are kind of acceptable, but still annoyingly slow. It's ~38MiB/s. I'm sure we can do better.

Optimization #2: Using a Faster zlib Implementation

At this point I started using a benchmark and the Go tool pprof. I saw that the decompression with zlib took significant time. I was using the standard library's compress/zlib, but had heard about github.com/klauspost/compress/zlib. This library has the same interface, so I just plugged it in and compared the performance; the improvement was not as big as I had hoped, but better than nothing:

$ time osmar 51.340 12.374 600 shop=bicycle > /dev/null
5.597 real  16.79s user  3.22s sys

Optimization #3: Using a Faster Protobuf Library

The benchmark had also shown me, that the Protobuf implementation I used, google.golang.org/protobuf/proto, used significant resources. I found that the faster github.com/gogo/protobuf library had been deprecated, so I kept looking and found github.com/planetscale/vtprotobuf. It's a plugin for the implementation I used, so it was relatively easy to adopt; using its improved unmarshalling code, I got these results:

$ time osmar 51.340 12.374 600 shop=bicycle > /dev/null
5.385 real  15.90s user  3.28s sys

Again, not a huge improvement, but there was another ace up this plugin's sleeve: memory pools. I imagined that memory allocation and garbage collection are big consumers of resources when parsing large amounts of data, so I hoped for significant improvements when reusing allocated memory. I was not disappointed:

$ time osmar 51.340 12.374 600 shop=bicycle > /dev/null
4.162 real  12.29s user  1.99s sys

Optimization #4: Skipping Unused Protobuf Fields

Until now I had used the two .proto files provided by the OSM project. I had noticed, that they contain quite a few fields that I don't care about. For example, every node includes information about who added it when and in which changeset. I wondered if skipping these fields could make life easier for the Protobuf deserializer, so I reduced my .proto files to only contain fields I actually use and was surprised by the result:

$ time osmar 51.340 12.374 600 shop=bicycle > /dev/null
3.744 real  11.08s user  1.74s sys

Optimization #5: Changing the Way Parsing Algorithm

Now I was getting to the point where the code I wrote myself was beginning play a significant role in the benchmark results. Specifically during the parsing of ways. Ways contain nodes and in order to check if a way lies within a given area, I need to check if any of its nodes match the location filter. Checking each node includes an expensive map-lookup and because most ways will usually lie outside the location filter, a lot of time was spent looking through all nodes of irrelevant ways.

This made me think about inverting the algorithm. For my use case it would still be OK to just find ways which lie completely in the area of interest. For this, I would need to check if any of a way's nodes does not match the location filter. This means the algorithm could abort after checking just one node for most ways. I implemented this in the form of a flag called "ExcludePartial" and it improved performance noticeably:

$ time osmar 51.340 12.374 600 shop=bicycle > /dev/null
3.389 real  10.00s user  1.93s sys

Optimization #6: Tweaking Go's Garbage Collector

The benchmark now showed me, that Go's garbage collector used a lot of resources. I investigated and learned, that it can be relaxed with the GOGC environment variable/the debug.SetGCPercent function. Setting it to 150 increased the memory usage by roughly 70%, but the performance gains were also impressive. I guess this is where the drawbacks of a garbage collected language show and Rust or C could shine:

$ time osmar 51.340 12.374 600 shop=bicycle > /dev/null
2.271 real  7.06s user  1.02s sys

Optmization #7: Using zstd

Although I had already improved the decompression performance with klauspost's library, I saw that the specification of PBF files allows for different compression algorithms. The only required one is zlib, so switching to another algorithm will break compatibility with some tools, but I had heard a lot of great things about zstd and gave it a try. I also used one of klauspost's libraries here, github.com/klauspost/compress/zstd.

As PBF files with zstd compression are not readily available, I wrote a simple tool for converting the compression: zstd-pbf. Here are the results:

$ zstd-pbf -best bremen-latest.osm.pbf bremen-latest.zstd.osm.pbf
$ export OSMAR_PBF_FILE=/tmp/bremen-latest.zstd.osm.pbf
$ time osmar 51.340 12.374 600 shop=bicycle > /dev/null
1.962 real  5.58s user  0.84s sys

Conclusion

Here is the optimization progress in a graph: A visualization of the performance improvements

I'm overall pretty happy with the end result. A runtime of 2s is good enough for me to prefer this version of osmar to the struggle with PostgreSQL databases. I have thus changed the mainline implementation of osmar to the PBF-reading variant: github.com/codesoap/osmar.

The PBF files are parsed at ~100MiB/s; I was able to roughly double this speed by switching from my old ThinkPad T480 to a faster computer, but the tool has still not reached the point where it is I/O limited. I can imagine that this point could be reached with a language that is not garbage collected and maybe has an even faster Protobuf library, but I'll leave this experiment to someone else.