K-d trees: Multidimensional data indexing

1. Right where we left off

Let’s recap where we left off in previous chapters. We are designing software for an e-commerce company, an application to find the closest warehouse selling a given product for any point on a very large map. See figure 9.1 to visualize. To have a ballpark idea of the kind of scale we need, we want to serve millions of clients per day across the country, taking products from thousands of warehouses, also spread across the map.

In section 8.2, we established that a brute-force solution where we skim through the whole list of points to compare differences can’t work for a live solution.

We have also seen how the multidimensional structure of the data prevents us from using the basic solutions we saw in the first part of the book, from heaps to hash maps.

Viable solutions, however, do exist. In this chapter we will first explain the issues we face in moving to multi-dimensional spaces; then, in this and the next chapter, we will delve into a few alternatives to efficiently tackle those challenges.

2. Moving to k-D spaces: Cycle through dimensions

You might have the impression that we’ve hit a dead end, and even in the scientific community it certainly seemed so for a long time.

The answer came in the form of a heuristic, by the hand (and brain) of Jon Louis Bentley.

The idea is as brilliant as it is simple and stems from the considerations that led us this far in chapter 8. If we restrict to 2-D spaces, instead of splitting each region into four sub-regions for each point, we can perform a 2-way split, but alternating splits along vertical lines to split along horizontal lines.

For each split, we partition the points in a region into two groups. Then, at the next split, when we choose the next pivot in each of the two sub-regions, we will use a perpendicular direction to do the next partitioning.

Figure 9.2 shows a few steps of this algorithm. You can see how for the first pivot chosen we draw a vertical line passing through the point, then for the second one we draw a horizontal semi-line (we are splitting a semi-plane now, not the whole plane again!), and then a vertical (semi)line again.

Notice that in the Cartesian plane, a vertical line through a point P=(P_x,P_y) has a peculiar characteristic: it is parallel to the y axis, and all points on the line have the same value for their x coordinate, P_x. Likewise, a horizontal line passing through P is made of all the points in the plane for which y=P_y.

So, when we split the plane along a vertical line through P=(P_x,P_y), what we really mean is that we create two partitions, one for the points L in the plane for which L_x<P_x, and one for those points R for which R_x>P_x. And similarly, for horizontal lines, using the y coordinates of the points.

This binary partitioning allows us to use binary trees to index our points. Each node in the tree is the pivot we chose to partition the remaining region of space, and its left and right subtrees gather all points in the two partitions, and represent the two sub-regions resulting from the split we perform along the pivot (check out figures 9.2 and 9.5).

Can this algorithm be generalized to higher dimensions? Yes, it naturally allows a generalization to higher dimensions, because we split on a single coordinate for each point, but we can do a round-robin through all the coordinates of a k-dimensional space, and the i-th level of the binary tree we build will correspond to a split along the (i mod k)-th dimension.

This means that in a 2-D space, the root will split the plane along the x axis, its children will split each of the two semi-planes along the y axis, and then their children will split again along the x axis and so on. In a k-dimensional space, with k > 2, we start with the first coordinate at level 0 (the root), and then move to the second at height 1, the third at height 2, and so on.

This way, we partition the plane into rectangular areas. With respect to our initial idea of splitting points with vertical lines only, we have fewer areas extending to infinity (while if we always used the same coordinate all areas would be infinite!) and avoid keeping distant points in the same partition.

At the same time, each node has just two children, and we can maintain all the advantages and guarantees of binary partitioning and binary search.

2.1. Constructing the BST

So far, we have just hinted at the construction of a binary search tree, implying that there is a direct translation of the partitions, or rather the pivots we choose, into a BST.

We have also implied that the BST we are going to construct is an important part of the k-d tree. Let’s give a more formal definition to clarify the relation between these two data structures.

DEFINITION A k-d tree is a binary search tree whose elements are points taken from a k-dimensional space, that is, tuples with k elements, whose coordinates can be compared. (To simplify, let’s assume that each coordinate’s values can be translated into real numbers.) In addition to that, in a k-d tree, at each level i, we only compare the i-th (modulo k) coordinate of points to decide which branch of the tree will be traversed.

In fewer words, a k-d tree can be described in terms of a binary search tree with a fancy comparison method on its keys. The added value is given by the algorithms for search that can be performed on this kind of tree much more efficiently than on other simpler data structures.

Figure 9.3 shows the construction of an example tree for the unidimensional case. This is an edge case, because singletons (tuples with dimension 1) will result in always using the x coordinate in points (that is, the whole singleton).

Notice how each node of the tree “covers” a region of the dataset in a hierarchical way: the root covers the whole dataset, level 1 nodes cover the left and right partitions created using the root as a pivot, and level 2 nodes cover the even smaller partitions created using level 1 nodes as pivots.

Here, when we use the term “cover,” we mean that given an entry X in the search space (in 1-D, given a real number X), and if we query the tree (as a binary search tree) for X, then all nodes we traverse during the search for X (which form a path from the root to a node N) cover X. In particular, the node on which the search stops is the one covering the smallest region for X.

In other words, each node in the tree is associated to a range of values for which it will be traversed when searching the tree; that range is the region covered by the node.

Make sure to go through the example in figure 9.3 and understand every step; maybe even try to run an example yourself (for instance, by just changing the pivots or their order and checking how the tree changes). It is important to understand the unidimensional case because it will make it simpler to understand the 2-d tree construction.

In order to better show the steps of the construction of a 2-d tree and highlight the advantages of cycling through the dimensions used to partition, we add a few cities to figure 9.1 in order to have a more uniform 2-D spatial distribution. In figure 9.4 we have also added a coordinate system: both the origin and the scale are arbitrary, and completely marginal for the results of the algorithm. We can always apply any translation and scale operation, since they preserve the Euclidean distances.

Figure 9.5 shows the results of the first couple of steps of the algorithm that builds the tree. In this figure you can see a different scale than in the previous picture. While the other one was more realistic, with distances between points closer to the real distances in miles between cities, it could also generate some unnecessary confusion in the drawings; moreover, as we mentioned, the coordinate system is not really important for the algorithm, as long as it’s consistent.

The first point we choose as pivot is “Opal City.”^[2] Since we first split using x coordinates, all cities on its left will go into one partition, and all cities on its right will go to the other partition. Then, as a second step, let’s focus on the right partition. We choose, for instance, “Civic City” as a pivot, and this time we have to use y coordinates to split, so all points in the top-right region will go in a sub-partition of the right one, and all points in the bottom-right region will go into the other. We now have three partitions, and we can further split any of them.

In figure 9.6, we show the resulting tree after inserting all the cities (without the warehouses). The edges have the same vertical or horizontal split on each level, and both vertical and horizontal edges are alternated in any path.

Splits now define 10 clearly separated regions, and if you look at how warehouses are distributed, you can get an idea of what city they might be close to with just a glance at the region they are in.

There is not, however, a direct match, and looking at regions is not enough to determine the closest point. For instance, if you look at b-8 in the picture, it’s not clear if Buffalo, Pittsburgh, or Harrisburg is the closest city, and c-6 looks closer to Syracuse than Harrisburg, despite being “covered” by the latter and not the former.

Determining the closest point(s) requires a few more steps than in regular binary search trees (the unidimensional case), and we will defer the description of the full algorithm to the next section.

As mentioned, this construction method for k-d trees naturally generalizes to higher dimensions, although it becomes more difficult to visualize the trees and the spaces itself.

For k=3, we can still imagine R³ divided in parallelepipeds, as shown in figure 9.7, but for 4-D and further, we lack an immediate geometric interpretation. That said, as long as we treat k-D points as tuples, it is possible to follow the same steps we have seen for a 2d-tree without any change.

2. Invariants

We could sum up the definition of k-d trees in a few invariants. A k-d tree is defined as a binary search tree, whose elements are k-dimensional points, and that abides by the following invariants:

All points in the tree have dimension k.
Each level has a split coordinate index j, such that 0 ≤ j < k.
If a node N’s split coordinate index is j, then N’s children have a split coordinate equal to (j+1) mod k.
For each node N, with split coordinate index j, all nodes L in its left subtree have a smaller value for N’s split coordinate, L[j] < N[j], and all nodes R on N’s right subtree have a larger or equal value for N’s split coordinate, R[j] > N[j].

3. The importance of being balanced

So far in this section we’ve consciously ignored a detail, a special characteristic we look for when it comes to binary search trees: whether the tree is balanced or not. As for the regular bst, you might have already figured out that the order in which we insert elements in our k-d tree determines the shape of the tree. Even having finite areas is not a synonym of having small areas, or good partitioning. If you look at the example in figure 9.6, we have carefully chosen points “by hand” in order to create a balanced tree, but it is easy to provide an example of a terrible insertion sequence that would create an imbalanced tree (just insert points starting from the top-left going in order toward the bottom-right corner, for instance).

Here we also consider a binary partitioning “good” only if it manages to split the whole set being partitioned into two subsets approximately of the same size.

If we manage to obtain a good partitioning at each step, we will have a balanced tree with logarithmic height. Given a sequence of points, it is possible, however, to choose an order of insertion for the points that will produce a skewed tree with linear height.

We’ll see in a few pages how we can prevent this from happening under certain conditions. Rebalancing on insertion is not, unfortunately, as viable an option for k-d trees as it was for binary search trees.

Before worrying about balancing trees, let’s see in detail the way methods such as insert, remove, and all the queries work for k-d trees.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Data structures and algorithms

K-d trees: Multidimensional data indexing

1. Right where we left off

2. Moving to k-D spaces: Cycle through dimensions

2.1. Constructing the BST

2. Invariants

3. The importance of being balanced

Leave a Reply Cancel reply

1. Right where we left off

2. Moving to k-D spaces: Cycle through dimensions

2.1. Constructing the BST

2. Invariants

3. The importance of being balanced

Leave a Reply Cancel reply

Login