As I’ve been working on analyzing the Green Line Extension with OpenStreetMap, I wanted to highlight some neat performance tricks used in OpenStreetMap’s PBF file format.
OpenStreetMap is like Wikipedia for structured geographical data. Besides literal streets, it contains over 5 billion buildings, parks, train tracks, trees, benches, and much more. This data set powers many maps that you see around the Internet. The OpenStreetMap data model includes three types of element:
- A node defines a point in space (with a lat/long)
- A way contains a list of nodes and is used for anything that’s not a point, e.g. a road, a building, or a body of water
- A relation explains how other elements work together
The most common machine-readable format for storing OpenStreetMap data is actually not PBF, but rather a straightforward XML format. A building represented using XML might look like:
<osm version="0.6" generator="CGImap 0.7.5 (14598 thorn-01.openstreetmap.org)" copyright="OpenStreetMap and contributors" attribution="http://www.openstreetmap.org/copyright" license="http://opendatacommons.org/licenses/odbl/1-0/">
<way id="29520980" visible="true" version="3" changeset="70167175" timestamp="2019-05-12T18:08:40Z" user="Alan_Bragg_Import" uid="8980145">
<nd ref="325407887"/>
<nd ref="325407889"/>
<nd ref="325407891"/>
<nd ref="325407892"/>
<nd ref="325407893"/>
<nd ref="325407894"/>
<nd ref="325407896"/>
<nd ref="325407899"/>
<nd ref="325407901"/>
<nd ref="325407904"/>
<nd ref="325407887"/>
<tag k="addr:housenumber" v="447"/>
<tag k="addr:street" v="Cambridge Street"/>
<tag k="building" v="yes"/>
</way>
</osm>
Although it’s nice to have a human-readable representation, this isn’t a particularly efficient way to encode this data. After removing unnecessary whitespace, the entire XML node above uses 718 bytes of storage in UTF-8 encoding; one byte for each character. Below, I’ll illustrate how the PBF format improves on this.
By starting with Google’s Protocol Buffers serialization format, it’s easy to make some quick improvements on encoding size. Since Protocol Buffers include several different types, node IDs can be represented as integers (in particular, OpenStreetMap uses sint64
). For example, the number 325,407,887 can be encoded as a Protocol Buffers variable-length integer, or varint, using only 5 bytes. That saves about half of the space since the XML encoding required 9 bytes: one byte for each base-10 digit.
In addition to this, Protocol Buffers naturally use less padding than XML. In the original XML, there are 13 bytes of padding for each integer: <nd ref="
is 9 bytes, and "/>\n
is 4 bytes. Using Protocol Buffers’ packed encoding format, the integers above are actually placed right next to each other in the serialized form, so the 11 node IDs above now require only 55 bytes to store (5 bytes for each node ID). This is a big improvement on the XML, in which the list of integers required 152 bytes (22 bytes for each node ID, made up of 13 bytes of padding + 9 bytes for the number itself).
Although this already cuts the memory usage by 2/3, the designers of the PBF format did not rest on their laurels. One thing you’ll notice above is that many of the node IDs are numerically close to each other. This is a common scenario in OpenStreetMap; I imagine this is because these nodes were all created at about the same time. To capture additional space savings, every node ID in a way is represented using the delta from the previous node ID (except for the first node ID). This is an excellent example of using domain knowledge for performance improvements.
325407887
325407889 -> +2
325407891 -> +2
325407892 -> +1
325407893 -> +1
325407894 -> +1
325407896 -> +2
325407899 -> +3
325407901 -> +2
325407904 -> +3
325407887 -> -17
Since Protocol Buffers’ sint64
type uses variable-length encoding, each of the deltas here is small enough to be encoded in a single byte. Putting this all together, this list of node IDs can now be stored in only 15 bytes (5 bytes for the first node ID + 1 byte for each of the subsequent 10 node ID deltas). This offers an impressive 90% savings relative to 152 bytes required by the original XML encoding.
Since this packed integer encoding is so efficient, the PBF format tries to use it in many situations. For example, the naive representation of a node looks like this:
message Node {
required sint64 id = 1;
// Parallel arrays.
repeated uint32 keys = 2 [packed = true]; // String IDs.
repeated uint32 vals = 3 [packed = true]; // String IDs.
optional Info info = 4; // May be omitted in omitmeta
required sint64 lat = 8;
required sint64 lon = 9;
}
This isn’t bad, but there is a small problem: in the serialized data, each field stored prepended with a tag, which is its numeric identifier, e.g. 8
for the lat
field. This adds an overhead of 3–6 bytes per node, depending on whether the keys
, vals
and info
fields are present. I’m beginning to feel like the Ebenezer Scrooge of megabytes here, but remember that the OpenStreetMap data set includes billions of nodes! To address this issue, the PBF format does this:
message DenseNodes {
repeated sint64 id = 1 [packed = true]; // DELTA coded
//repeated Info info = 4;
optional DenseInfo denseinfo = 5;
repeated sint64 lat = 8 [packed = true]; // DELTA coded
repeated sint64 lon = 9 [packed = true]; // DELTA coded
// Special packing of keys and vals into one array. May be empty if all nodes in this block are tagless.
repeated int32 keys_vals = 10 [packed = true];
}
By storing an array of nodes in this more columnar format, the space savings of using packed encoding and delta coding can be applied to the id
, lat
, and lon
fields.
Besides the efficient integer encoding, OpenStreetMap’s PBF format has some additional space-saving tricks up its sleeve. One is to store each string only once, and store a reference to the string. This should be extremely handy for common tags, for example addr:housenumber
. This technique, string interning, is also used by many mainstream programming languages such as Java and Python.
When considered in total, all of these tricks mean that the PBF file for Massachusetts only requires 233 MB of space. It’s not quite fair to compare this to the uncompressed XML file, because the PBF file uses zlib compression internally. Since zlib achieves a compression ratio of about 60% on the PBF data, the size of the PBF file without any compression would be about 400 MB. Compared to the 5.4 GB of the uncompressed XML data, the PBF file is about 93% smaller. Nice!
The PBF format doesn’t only save on space; it improves performance in a few other ways as well. Notably, it divides each file into blocks. Each block is independently encoded into Protocol Buffers format and compressed. This means that a computer can effectively use all of its CPUs to decode the PBF file in parallel.
The layout of the file looks kind of like:
- #1 block header size in bytes
- #1 block header
- #1 block
- #2 block header size in bytes
- #2 block header
- #2 block
...
Well, that’s it for the performance tricks. If you’d like to download some OpenStreetMap data, you can find PBF files hosted on the OpenStreetMap download server hosted by Geofabrik, a consulting and software development firm based in Karlsruhe, Germany specializing in OpenStreetMap services.