When working on a map-related project recently, we ran into performance issues when placing large number of objects on the map. The operations we needed were simple: inserting markers onto the map and collecting the subsets that met some search criteria (usually markers within some rectangular area or within a given distance of a specific map coordinate). After trying various approaches, we found one with
O(log(n)) worst-case complexity for inserting and searching, resulting in a 10x or even 100x performance improvement over our initial attempts. We’ve made the code available as a library on Github. Read on for more details.
The search operation is used heavily when calculating marker clusters. We looked into many implementations, they all store the geographical data in simple arrays of
The search implementations traverse the array sequentially and test each item individually, adding the item to the resulting subset if the test passes. The complexity of the search operation is thus
Don’t take me wrong, this works great for sets with tens or maybe even hundreds items in the map, but when calculating clusters on sets with thousands (or even more) items on the map, performance quickly becomes an issue. Especially when you running in mobile browsers with limited CPU power, the user experience is degraded by sluggish response times.
So our mission was straightforward: come up with a better performing data structure for holding geolocated objects, where searches don’t take ages even on large data sets.
We looked into data structures for 2D (or even multi-dimensional) data sets, such as k-d trees, quadtrees, and similar, and evaluated their insert and search algorithms. We focused on worst-case complexity of the operations, as we were usually getting pre-ordered data from the server, which is typically not the optimal scenario for the insert operation. (Why is that? Just imagine you are inserting a sequence of sorted integers into binary search tree one by one. Without balancing the tree on the fly, instead of a tree you’ll end up with a linked list with
O(n) insert/search complexity. And the same applies to multi-dimensional data structures.) With this in mind, we didn’t find an existing solution with better than
This was pretty frustrating since we knew that there was no lack of self-balancing (binary-) search trees with amortized
O(log(n)) complexity for both insert and search operations, as long as you are working with simple numeric (i.e. one-dimensional) keys. If we were able to map the
<latitude,longitude> pairs into simple numeric keys, we would be able to use these keys in balanced (one-dimensional) trees, benefiting from their fast worse-case performance.
In order to avoid traversing the whole tree later on, the mapping from two-dimensional to one-dimensional keys must have two important properties: it must retain locality and ordering. Which means that (1) when two 2D coordinates are “close” in the plain, their resulting numeric keys must also be somewhat “close”, and (2) when mapping an rectangle area, all the calculated values representing coordinates inside the rectangle must be in the interval
(A, B), where
A is the minimal calculated value of the four rectangle corners, and
B is the maximal value.
There are several mapping available. We tried both Hilbert curve and Z-order curve. Finally we picked the latter, since it was easier to integrate into our implementation. The code just interleaves the binary representations of the two coordinate values:
Just imagine x in the above image to be the binary representation of latitude and y the binary representation of longitude. The resulting index value for this (x,y) pair is the corresponding value in the table. The last problem to solve here is how to convert the latitude and longitude (floating point) values into binary (fixed decimal) representation.
So we implemented Z-order curve mapping from latitude (float range: -90.0 .. +90.0) and longitude (float range: -180.0 .. +180.0) pairs into 64-bit float-powered integral values (integer range 0 .. 2^53). We then happily used self-balancing red-black trees as the underlaying data structure for the actual data, operating on the calculated numerical keys.
Our testing on real data validated this approach. The search operation now discards a large proportion of the stored items automatically, which speeds up the search operation dramatically. Depending on the data set, it can be 10 or even 100 times faster compared to traversing the whole set each time. Now we can hold far more items in the sets while achieving the same (or even better) search and insert performance.
And yes, you too can use our library in your cartographic applications, since we decided to make it available as the geo-tree npm package and as a stand-alone library for use in the browser. The latest version of the library is available on Github.
Try out the library and let us know what you think. We’d love to hear your feedback.