K-d trees: Methods of Multidimensional data indexing

We have now seen a few examples of k-d trees and how we can construct them. By now, it should be clear what a k-d tree looks like, the main ideas behind this data structure, and how we can leverage it.

It’s time to delve into the main methods for k-d trees and see how they work and how they can be implemented. In this section we will present pseudo-code for these methods, and you can find an actual implementation on our repo on GitHub.

Figure 9.8 shows a pre-constructed k-d tree that we are going to use as a starting point for this section. In order to focus on the details of the algorithm, we are going to use a simplified view, showing points on a Cartesian plane where axes are omitted. Points are denoted with capital letters (we are leaving out city names for now) to abstract out the context and focus on the algorithms. Vertical and horizontal splits are
still shown, but we won’t highlight the regions as we did in figures 9.6; as a consequence, we can fill tree nodes (instead of edges) with shading to show that vertical or horizontal splits will be used at a certain level.

While in figure 9.8 coordinates are shown next to both nodes and points, we might sometimes omit them for the sake of neatness, as shown in figure 9.8.

We will start with the “easy” methods first: search and insert work almost exactly as in basic BSTs; we will still describe them and provide their pseudo-code, but if you are already familiar with binary search trees, feel free to skim through the next few sub-sections.

But before that, we need to define a model for the k-d tree and its nodes; see listing 9.1.

Listing 9.1 The KdTree class

class KdNode

#type tuple(k)

point

#type KdNode

left

#type KdNode

right

#type integer

level

function KdNode(point, left, right, level)

class KdTree

#type KdNode

root

#type integer

function KdTree(points=[])

A KdTree just contains a root node and a constructor method, taking an optional array of points as input. We’ll take a look at how a k-d tree can be constructed later, after we introduce the insertion method. For now, suffice it to say that it will set the root to either a void entry (be it null, a special instance of KdNode, or whatever is more appropriate for the language used).

For the sake of convenience, let’s assume a tree also stores the value k, the dimension of the space on which the tree is defined.

The root is, as said, an instance of KdNode. This data structure models a node of a BST, with its left and right children, and its value, a point in the k-dimensional space. We will use the special value null to model an empty node (and thus an empty tree).

1. Search

In section 9.2 we have implicitly described how search works on k-d trees. There is nothing fancy in this algorithm; it’sjust a regular search on binary search trees storing tuples, with the caveat that instead of comparing the whole tuple at each step, we only use one coordinate. At level i, we compare the i-th coordinate (or i mod k, if i > k).

Listing 9.2 shows a few helper functions that will help us keeping our code clean. We encapsulate in these functions the logic of cycling through split coordinates while traversing the tree, instead of duplicating it across all the methods. This way the other methods will be more readable, and if we ever have to change the way this comparison is done (for instance, because we find a bug or we want to implement a fancier algorithm), we just need to touch one single place in the code base.

Listing 9.3 shows the pseudo-code for the search method. The pseudo-code for all these methods will assume that we are providing an internal version that takes a KdNode as argument. The public API for KdTree will provide adapter methods that will call the internal ones; for instance KdTree::search(target) will call search(root, target).

This way we have more flexibility in reusing these methods (for instance, we can run search on just a subtree, not the whole tree).

Notice that this recursive implementation is eligible for tail-recursion optimization on those languages and compilers supporting it (check appendix E for an explanation of tail-recursion).

Let’s follow the example in figure 9.9 step by step.

We start by calling search(A, (-1.5, -2)), where node A is the root of the k-d

tree, as shown in the figure. Since A is not null, at line #2 the condition fails, and we can compare A.point, which is the tuple (0, 5), to our target at line #4. They obviously don’t match, so we move on to line #6 and use the compare helper function to check which direction we should take. A.level will evaluate to 0, so we compare the first value in each of the tuples: -1.5 < 0, so we traverse the left subtree and call search(C, (-1.5, -2)).

For this call, we repeat more or less the same steps, except this time C. level is equal to 1, so we compare the second value in each tuple. -2 < 6 so we still go left, calling search(D, (-1.5, -2)).

Once again, we go through lines #2, #4, and #6, and we take a left turn; only this time, D.left == null, so we call search(null, (-1.5, -2)), which will return null at line #2. The execution backtracks through the call stack, and our original call will also return null, stating that the target point was not found on the k-d tree.

Figure 9.10 shows another example, calling search(A, (2, -5)). On the first call, conditions at lines #2 and #4 are false, as well as the condition at line #6, since 2 > 0. This time, therefore, we take a right turn at node A, and recursively call search(B, (2, -5)), then in turn search(E, (2, -5)), for which the condition at line #4 is true (E.point matches target), and thus we finally return node E as the result of the original call.

How fast is search on a k-d tree? Like for regular BSTs, its running time is proportional to the height of the tree. If we keep the tree balanced, therefore, the running time will be O(log(n) ) for a tree holding n points.

2. Insert

As we have seen on BSTs, insertion can be implemented in two steps. The first step is running a search for the point to add, which will either find that the point is already in the tree, or stop at its parent-to-be, the node to which the new point should be added as a child. If the point is already in the tree, then what we do next depends on the policy we adopt on duplicates. If duplicates are not allowed, we might ignore the new point or fail; otherwise we have a wide range of solutions, from using a counter on nodes to keep track of how many times a point was added, up to consistently adding duplicates to either branch of a node—for instance, always in the left sub-branch, although as we have seen in appendix C, this leads to slightly unbalanced trees on average.

If, instead, the point was not on the tree, search will fail on the node that should have the new point as its child, and the second step will consist of creating a new node for the point and adding it in the correct branch of its parent.

Listing 9.4 shows an approach that doesn’t reuse the search method. This approach, while not DRY, allows us to simplify both methods. To be reusable in insert, search should also return the parent of the node found (and even more importantly, the last node traversed, when the target is not found), which is not of any particular use for search, whose implementation would thus become unnecessarily complicated. Moreover, this way we can write insert in a more elegant way, using a pattern that naturally supports immutability.

Figures 9.11 and 9.12 show two examples of insertion of new points on the k-d tree in figure 9.10.

Let’s follow the first example step by step. It starts with a call to insert(A, (-1.5, 2)), where we don’t pass any value for level, thus defaulting it to the right value for root (as defined in the function signature, this value is 0).

A <> null, so the condition at line #2 won’t match; A.point <> (-1.5, 2), and also at line #4 the condition is false. When we get to line #6, -1.5 < 0, so compare will return -1, and we traverse the left subtree and call insert(C, (-1.5, -2), 1).

The next few calls will proceed similarly (like we have seen for search), and in turn we call insert(D, (-1.5, -2), 2), insert(null, (-1.5, -2), 3). For the latter, the condition at line #2 will be true, so we create a new node, KdNode((-1.5, -2) , null, null, 3), and return it to the previous call in the stack trace. There, at line #7, we set .left to this new KdNode we created, and then return D.

Notice how at line #6 we are breaking ties by partitioning coordinates with the same value as the current node on its right. This decision might seem of little importance, but we will see that it’s a big deal when it comes to deleting nodes.

If we take a look at the stack trace for the example in figure 9.12, we can see how it is entirely similar to the one for the previous case:

insert(A, (2.5, -3))

insert(B, (-1.5, 2), 1)

insert(E, (-1.5, 2), 2)

insert(null, (-1.5, 2), 3)

return new KdNode((-1.5, 2), null, null, 3)

return E

return B

return A

Balanced tree

Before moving on to the most advanced methods designed for k-d trees, let’s take a step back. In section 9.2.3 we have already described one key point of k-d trees. We need our tree to be balanced. In the previous sections we saw how search and insertion have running time O (h) proportional to the height of the tree, and thus having a balanced tree will mean that h = log(n), and in turn that all these methods will run in logarithmic time.

Unfortunately, a k-d tree is not a self-balancing tree, like RB-trees or 2-3-trees, for example. This means that if we perform a large number of insertions and deletions on a tree, on average the tree will be tendentially balanced, but we have no guarantee of that. Moreover, if we resolve ties on coordinate comparison by always going to the same side, then it is proven that we will break the balance over many operations.

To solve this problem, we can slightly change the compare method that we defined in listing 9.2 so that it will never return 0. Whenever we have a match, half of the time it will return -1, and half of the time +1, thereby achieving a better balance for the tree. The choice needs to be consistent, so we can’t use randomness and have a perfect balance; instead, a possible solution to implement this correction, as shown in listing 9.5, is to return -1 when the node’s level is even, and +1 when it’s odd (or vice versa; it doesn’t really matter, as long as it’s consistent).

Listing 9.5 Revised compare

This helps us to achieve a better-balanced tree on average, but still doesn’t provide any guarantee. To date, there is not a solution to easily keep a k-d tree balanced while maintaining O(h) running time for insertion.

Nevertheless, if we know the set of points to insert in advance (when we create the k-d tree), then we can find an optimal order of insertion that allows us to construct a balanced tree. If the tree is not changed after construction, or if the number of elements inserted/deleted is negligible in comparison to the size of the tree, then we can have a worst-case balanced k-d tree, and insert and search will run in worst-case logarithmic time.

Listing 9.6 shows the algorithm to create a balanced k-d tree from a set of points.

The principle is simple: the tree has to hold all the points in the set and, ideally, we would like left and right subtrees of the root to have the same number of elements. To achieve that, we can find the median in the set of points with respect to the first coordinate of the points, and use it as a pivot for the root, having half of the remaining points that will end up on the root’s left branch, and half on its right branch. But each of the branches of the root is a k-d tree itself, so it is possible to repeat the same step for the root’s left and right subtrees, with the caveat that we need to find the medians comparing the second coordinates for each point, instead. And so on for each level of the tree; we just need to keep track of the depth of the recursion, which tells us in what level of the tree we currently are.

The key point in listing 9.6 is the call to the partition method at line #7. We need to pass level as an argument because it will tell us which coordinate we need to use to compare points. The result of this call will be a tuple with the median of the points array and two new arrays with (n-1)/2 elements each, if size(points) == n.

Each point in left will be “smaller” (with respect to the coordinate at index level % k) than median, and each point in right will be “larger” than median; therefore, we can recursively construct both (balanced) subtrees using these two sets.

Notice that we can’t just sort the array once and chunk it up into halves recursively, because at each level the sorting criteria change!

To understand how this works, let’s consider a call to construct- KdTree([(0,5),(1,-1),(-1,6),(-0.5,0),(2,5),(2.5,3),(-1, 1),(-1.5,-2)]).

The median for this set (with regard to the first coordinate, the median of all the first values of the tuples) is either – 0.5 or 0: there is an even number of elements, so there are technically two medians. You can double-check the values by sorting the array.

Say we choose -0.5 as the median; then we have

(median, left, right) ← (-0.5,0), [(-1, 1),(-1.5,-2),(-1,6)], [(1,-1),

➥ (2.5,3),(2,5)(0,5)]

So, at line #8 we call constructKdTree([(-1, 1),(-1.5,-2),(-1,6)], 1) to create the root’s left subtree. This in turn will partition the sub-array, but comparing the second coordinates of each tuple, the median of y coordinates is 1, so we have

(median, left, right) ← (-1, 1), [(-1.5,-2)], [(-1,6)]

And so on; the method would similarly run on the other partitions created on the initial array.

What’s the running time of method constructKdTree? We will use T_k (n) to denote the running time for a k-d tree of dimension k, on an array with n elements. Let’s check the method step by step: lines #2-5 only require a constant amount of time, as does line #10, which is just creating a new node. Lines #8 and #9 are recursive calls, and they will be called on sets of points with at most n/2 elements, so we know they will take T_k (n/2) steps each.

Finally, line #7, where we call partition: it’s possible to find a median in linear time, and we can also partition an array of n elements around a pivot with O(n) swaps (or create two new arrays, also with O(n) total assignments).

So, summing up, we have this formula for the running time:

T_k(n) = 2 * T_k(n/2) + O(n)

There are a few ways to solve this equation—for example, the substitution method or telescoping—but the easiest is probably using master theorem.^[4] All of these methods are beyond the scope for this book, so we will just provide you with the solution, leaving it to the curious reader to work out the details:

T_k(n) = O(n * log(n))

In other words, the balanced construction method takes linearithmic time.

To complete our analysis, if we look at the extra memory needed by this method, it will require O(n) memory for the tree. However, there is more. If we don’t partition the array in place and create a copy for each left and right sub-array, then each call to a partition will use O(n) extra memory. Deriving a similar formula to the one for the running time, we could find out that also M(n) = O(n * log(n)).

Conversely, by partitioning the array in place, we can obtain the best possible result:

M_k(n) = O(n)

That’s because we only need a constant amount of memory for the internals of the function, plus the O(n) memory needed for the tree.

4. Remove

After search and insert, we can continue with the third basic operation on a container: remove. This is despite the fact that on a k-d tree, delete is not such a common operation, and some implementations don’t even offer this method. As we discussed in the previous section, k-d trees are not self-balancing, so they perform best when they are created with a static set of points, avoiding frequent insertion and removal.

Nevertheless, in any real-world application you’ll likely need to be able to update your dataset, so we are going to describe how to remove elements. Figures 9.13 and 9.14 show the remove method in action on our example k-d tree, and the result of removing point D from it.

Similarly to insert and search, this method is based on binary search tree removal. There are, however, two issues that make k-d tree’s version sensibly more complicated, and they are both connected to how we find a replacement in the tree for the node we are going to remove.

To see what these issues are, we need to take a step back. In binary search trees, when we delete a node, we can face one of three situations (see figure 9.15).

The node we need to remove is a leaf. In this situation, we can just safely remove the node from the tree, and we are done.
The node N to-be-removed has only one child. Here simply removing the node would disconnect the tree, but we can instead bypass it by connecting N’s parent to its children (independently of it being in the left or right subtree of N). This won’t violate any invariant for the For instance, in the case shown in figure 9.15B, N is a left child of its parent P, so it’s smaller than or equal to P, but likewise, all elements in the subtree rooted at N will be smaller than or equal to P, including N’s child L.
If the node N that we need to remove has both children, we can’t just replace it with one of its children (for instance, if we were to replace the root in figure 9.15C with its right child R, then we would have to merge R’s left child with N’s, and that would require worst-case linear time (and also be a waste).

Instead, we can find the successor of N. By construction, it will be the minimum node in its right subtree, let’s call it M, which in turn means the leftmost node of its right subtree. Once we have found it, we can delete M and replace N’s value with M’s value. No invariant will be violated, because M was no smaller than N, and N in turn was no smaller than any node in its left subtree.

Moreover, M will certainly configure as either case (A) or case (B), because being the left-most node, it won’t have a left child. This means that deleting M will be easy and recursion stops at M.

When we move from regular BSTs to k-d trees, the first difference is caused by the fact that at each level we only use a single coordinate to partition points in the two branches. If we have to replace a node N at level i, at that level we are using coordinate j = i mod k, so we know its successor (for coordinate j) will be in N’s right subtree. However, when we move to N’s child, that node will use another coordinate, j₁ = (i+1) mod k, to partition its children. As you can see in figure 9.16, that means that the successor of N doesn’t have to be in R’s left subtree.

That’s bad news, because it means that while for BSTs we could quickly traverse N’s right subtree to its leftmost node in order to find N’s successor, now we can only do that for level l where l mod k == i mod k. In all the other levels, we will have to traverse both subtrees to look for the minimum.

Listing 9.7 shows the pseudo-code for the findMin method. The first difference you can see with respect to BST’s version is that we need to pass the index of the coordinate for which we look for the minimum. For instance, we could call findMin(root, 2) to search the node in the whole tree with the minimum value for the third coordinate—assuming k>=3.

This greater complexity in the findMin method unfortunately reflects on its running time. It can’t be logarithmic as with BSTs because we are traversing all branches for all levels except the ones matching coordinateIndex, so in (k-1)/k cases.

And, in fact, it turns out that the running time for findMin is O(n^(k-1)/k). If k==2, this means O(n^1/2) = 0( √n ), which is not as good as logarithmic, but still sensibly better than a full scan. As k grows, though, this value gets closer and closer to O(n).

The enhanced version of findMin solves the issue with the coordinates. Now you might hope that plugging it into the regular BST’s remove is enough, but that’s unfortunately not the case. There is another issue that complicates things a bit further.

If you go back to figure 9.15 (B), for BSTs there were two lucky cases for which deleting a node was easy: deleting a leaf and deleting a node with only one child.

For k-d trees, only leaves can be deleted easily. Unfortunately, even if the node N to be deleted has only one child C, we can’tjust replace N with its child, because this would change the direction of the splits for C and its entire subtree, as shown in figure 9.17.

In that example, we attempt to remove node B, which only has one child and no right branch (it works similarly in the symmetric case). If we tried to simply replace B with its children E, this node would appear one level up in the tree, and likewise all its children.

So, before node E was using x coordinates to partition nodes in its subtree, so that node H was on the right of E because H’s x coordinate (2.5) is larger than E’s (2).

After we move E and its subtree up, we would need to use y coordinates to partition nodes in E’s subtree. But H’s y coordinate (-8) is larger than E’s (-5), so node H doesn’t belong to E’s right branch anymore, and the k-d tree invariants are violated.

In this case it might look like something easy to fix, but we would need to reevaluate every single node in E’s subtree and rebuild it.

This would certainly require O(n) time, where n is the number of nodes in the subtree rooted at the node we remove.

A better solution would be to replace the node N that we want to remove with its successor or its predecessor. If N only has a right child, we can just find its successor using findMin, as we described in the example in figure 9.16.

When node N only has a left child, can we replace it with its predecessor? As much as you might be tempted to think so, in this situation another issue comes up.

We mentioned when we described the insert method that the way we break ties on insert has an influence on the remove method as well.

And indeed, figure 9.18 shows an example where this becomes relevant. The problem is that when we break ties on insert and search by going right, we implicitly assume an invariant: that for any internal node N, no node in its left branch will have the same value for the coordinate used to partition N’s subtree. In figure 9.18, this means that no node in the left branch of node B has a y coordinate equal to N’s.

If we replace N with the max of its left branch, however, it is possible that in N’s old left branch there was another node with the same y coordinate. In our example, that would be the case since there are two nodes with the same maximal value for y, node I and node H.

By moving node I to replace B, we would therefore break search, because node H would never be found by the search method in listing 9.3.

Luckily the solution is not too complicated. We can instead run findMin on the left branch, replace N’s point with the node M found by findMin, and set N’s old left branch as the right branch of this new node we are creating, as shown in figure 9.19.

Then we just need to remove the old node M from that right branch. Notice that unlike what happened with binary search trees, we can make no assumptions on M here, so we might need to repeat these steps in the recursive call deleting M.

Listing 9.8 sums up all these considerations into a pseudo-code implementation for method remove.

If we look at the running time for remove, the cost of the calls to findMin drives up the total cost, which thus can’t be logarithmic anymore (as it was for BSTs). To perform a more rigorous analysis, let’s again denote as T_k(n) the running time for this method, where k is the dimensionality of the points’ space and n is the number of nodes in the tree. When we look more closely at each conditional fork, if we assume that we are working on a balanced tree, then

Each branch of the conditional at line #15 would trigger a recursive call on approximately half the nodes, and so require time T_k(n/2).
If we enter the code blocks after conditionals at lines #2 or #13, those only require constant time.
Conditionals at lines #5 and #9 both will run code blocks that requires creating a new node, O(1), running findMin, O (n^1-1/k), and recursively calls remove.

The worst case here is the latter: we don’t know where the minimum node will be in its branch. It could be far down the tree or just the root of the branch; thus, if we end up in a case similar to figure 9.17 (either with a missing left or right branch, indifferently), we might have to call remove recursively on n-1 nodes, in the absolute worst case.

However, we assumed that our k-d tree is balanced. Under this assumption, left and right branches should have more or less the same number of nodes, and therefore if the right branch of a node is empty, the probability that the left branch has one node is still high, that it has two nodes is less likely, and it goes down with three nodes, and so on. For a certain constant, for example 5, we can say that in a situation like the one in figure 9.17, where a node has a single branch, then it is highly unlikely that branch has more than a constant number of nodes (say five). And we can thus assume that in a balanced tree, such an imbalanced case can happen at most a constant number of times during a call to remove. Or, more precisely, on a large number of removals, we can assume that the amortized cost of running the code blocks starting at lines #5 and #9 would be T_k(n/2).

Our recurrence would therefore become

Tk(n) = Tk(n/2) + O(n^1-1/k)

Using the master theorem’s third case, since 1-1/k > log₂(1)=0, and (n/2)^1-1/k < n^1-1/k, we can then conclude that the amortized time required by remove on a balanced k-d tree is

T_k(n) = O(n^1-1/k)

In other words, remove’s running time is dominated by findMin; this also means that in a 2-D space, the amortized running time for remove would be O( √n ).

5. Nearest neighbor

We are now ready to study the most interesting operation provided by k-d trees, the nearest neighbor (NN) search. First, we are going to restrict to the case where we search for the single closest point in the dataset with respect to a target point (which, in general, doesn’t have to be contained in the same dataset). Later we will generalize this operation to return an arbitrary number m¹⁰ of points such that no other point in the dataset is closer to the target.

In a brute-force scenario, we would have to compare each point in the dataset to our target, compute their relative distances, and keep track of the smallest one, exactly the way we search for an element in an unsorted array.

However, a k-d tree, much like a sorted array, has structural information about the relative distance and position of its elements, and we can leverage that information to perform our search more efficiently.

Listing 9.9 shows the pseudo-code implementing nearest neighbor search. To understand this code, however, we first need to ask, how does nearest neighbor search work?

We start from the consideration that each tree node covers one of the rectangular regions in which the space was partitioned, as we have shown in figures 9.5 and 9.6. So first we want to find which region contains our target point p. That’s a good starting point for our search, because it’s likely that the point stored in the leaf covering that region, G in this example, will be among the closest points to P.

Can we be sure that g will be the closest point to p, though? That would have been ideal, but unfortunately that’s not the case. Figure 9.20 shows this first step in the algorithm, traversing a path in the tree from the root to a leaf, to find the smallest region containing p.

Figure 9.20 The first few steps of nearest neighbor search. The first phase of nearest neighbor search consists of a search on the tree, during which we keep track of the distance of each node (aka point) we traverse: more precisely, if we need to find N nearest neighbors for P, we need to keep track of the N smallest distances we find. In this case we are showing the query for N=1. Thus, if the search is successful, then we definitely have the nearest neighbor, at distance 0. Otherwise, when search ends, we are not yet sure we have found the actual nearest neighbor. In this example, the point at the minimum distance we have found during traversal is D but, as you can see from the figure, we can’t be sure that another branch of the tree doesn’t have a point within the radius given by dist(D).

As you can see, we check the distance of every intermediate point along the path, because any of them can be closer than the leaf. Even if intermediate points cover larger regions than leaves, inside each region we don’t have any indication of where dataset points might lie. If we refer to figure 9.20, if point A had been at (0,0), the tree would have had the same shape, but p would have been closer to a (the root) than G (a leaf).

But even that is not enough. After finding the region containing P, we can’t be sure that in neighboring regions there aren’t one or more points even closer than the closest point we found during this first traversal.

Figure 9.21 exemplifies this situation perfectly. So far, we have found that D is the closest point (among those visited) to P, so the real nearest neighbor can’t be at a distance larger than the one between D and P. We can thus trace a circle (a hypersphere, in higher dimensions) centered at P and with a radius equal to dist(D, P). If this circle intersects other partitions, then those regions might contain a point closer than D, and we need to get to them.

How do we know if a region intersects our current nearest neighbor’s hypersphere? That’s simple: each region stems by the partitioning created by split lines. When traversing a path, we go on one side of the split (the one on the same side as P), but if the distance between a split line and P is less than the distance to our current nearest neighbor, then the hyper-sphere intersects the other partition as well.

To make sure we visit all partitions still intersecting the NN hyper-sphere, we need to backtrack our traversal of the tree. In our example, we go back to node D, then check the distance between P and the vertical split line passing through D (which, in turn, is just the difference of the x coordinates of the two points). Since this is smaller than the distance to D, our current NN, then we need to visit the other branch of D as well. When we say visit, we mean traversing the tree in a path from D to the closest leaf to P. While we do so, we visit node F and discover that it’s closer than D, so we update our current NN (and its distance: you can see that we shrink the radius of our nearest neighbor perimeter, the circle marking the region where possible nearest neighbors can be found).

Are we done now? Not yet; we need to keep backtracking to the root. We go back to node C, but its split line is further away than our NN perimeter (and it doesn’t have a right branch, anyway), so we go back to node A, as shown in figure 9.22.

At node A we took the left branch during search, meaning P is on the left semi-plane. If we were to traverse the right subtree of A, all the points in that subtree would have their x coordinate greater than or equal to A’s. Therefore, the minimum distance between P and any point in the right sub-tree of A is at least the distance between P and its projection on the vertical line passing through A (i.e., A’s split line). In other words, the minimum distance for any point right of A will at least be the absolute value of the difference between P’s and A’s x coordinates.

We can therefore prune search on A’s right branch, and since it was the root, we are finally done. Notice how the more we climb back on the tree, the larger are the branches (and the regions) that we can prune—and the larger the saving.

We can generalize this method to find an arbitrary large set of the closest points in the dataset, also known as the n-nearest-neighbor¹¹ search method.

The only differences are

Instead of the distance of a single point, we need to keep track of the m shortest distances if we want the m closest points.
At each step, we use the distance of the m-th closest point to draw our NN perimeter and prune search.
To keep track of these m distances, we can use a bounded priority queue. We described something similar in chapter 2, section 2.7.3, when we described a method to find the m largest values in a stream of numbers.

Listing 9.10 details the pseudo-code for the nNearestNeighbor method.

What’s the running time for nearest neighbor search? Let’s start with the bad news: in the worst-case scenario, even for a balanced tree, we might have to traverse the whole tree before finding a point’s nearest neighbor(s).

Figure 9.23 shows a couple of examples of such a degenerate case. While the second example is artificially constructed as a literal edge case with all the points lying on a circle, the one in figure 9.23 (A) shows the same tree we used in our previous examples and demonstrates how even on random, balanced trees it is possible to find counter-examples where the method behaves poorly by just carefully choosing the target point for the search.

So, unfortunately there isn’t much to do: the worst-case running time for this kind of query is O(n).

That’s the bad news. Luckily, there is also a silver lining.

Turns out that the average running time for nearest neighbor search on a balanced k-d tree is O(2^k + log (n) ). The proof for this probabilistic bound is particularly complex and would require too much space to properly cover it here. You can find it in the original paper by Jon Bentley that first introduced k-d trees.

Nevertheless, to give you an intuition about why it works this way, consider a twodimensional space: one point divides it into two halves, two points info three regions, three points will create four regions, and so on, and in general n points will create n+1 regions. If the tree is balanced and n is sufficiently big, we can assume these regions are approximately equally sized.

Now suppose the dataset covers a unitary area. When we run a nearest neighbor search we first traverse the tree from the root to the closest leaf,^[9] and this, for a balanced tree, means traversing O(log(n)) nodes. Since we hypothesized that each region is approximately the same size, and we have n+1 of them, the area covered by the closest leaf will also be a rectangular region of area approximately equal to 1/n. That means that there is a reasonably high probability that the distance between the target point and the nearest neighbor we have found during this traversal is no larger than half the diagonal of the region, which in turn is smaller than the square root of the region’s area, or in other words,

The next step in the algorithm is backtracking to visit all the regions that are within that distance from the target point. If all the regions are equally-sized and regularly shaped, this means that any region within distance must be in one of the neighboring rectangles (with respect to our leaf s region), and from geometry we know that

in such a situation, with regular equally sized rectangles that can be approximated to a rectangular grid, each rectangle has eight neighbors. From figure 9.24, however, it is possible to see how, on average, even if potentially we would have to traverse at most eight more branches, it’s likely we only have to check four of them, because only the ones adjacent to the current region’s sides will be within distance . Hence, this makes the total average running time O (4*log (n)) .

If we move to R³, then the possible neighbors for each cube are 26, but the minimum distance will be . With similar considerations we can infer that only less than eight regions will be close enough to have points within the minimum distance found so far; likewise, if we move to R⁴, and so on.

So, in summary, after finding the region where the target point of our search lies, we need to examine another O(2^k) points, making total running time O(2^k + log (n)). While k is a constant for a given dataset, and we could in theory ignore it in big-O analysis, in the general case k is considered a parameter, as we measure the performance of this method with respect to both size and dimension of the k-d tree changes. It’s also apparent that for large values of k, 2^k becomes so big that it dominates the running time, since

and in practice for k > 7 there is already no chance of having a dataset big enough to satisfy the inequality. For n > 2^k, however, this method still has an advantage over the brute-force search.

A final consideration: Pruning heavily depends on the “quality” of the nearest neighbor we have found so far. The shorter the distance to the current nearest neighbor, the more branches we can prune, and in turn the higher speed-up we can get. Therefore, it is important to update this value as soon as possible (we do this in our code for each node on the first time we visit it) and to traverse the most promising branches first. Of course, it is not easy to determine what the most promising branch is at any point. A good, although imperfect, indicator could be the distance between the target point and split line for a branch. The closer the target, the larger should be the intersection between the nearest neighbor perimeter (the hyper-sphere within which it’s still possible to find a closer point to the target), and the region on the other side of the split line. That’s the reason why we traverse the tree using a depth- first search: we backtrack on the smallest branches first, so hopefully when we reach larger branches close to the top of the tree, we can prune them.

6. Region search

While k-d trees were mainly designed to perform nearest neighbor search, they turned out to be particularly efficient for another kind of operation: querying the intersection of our dataset with a given region in the k-dimensional space.

In theory this region could be of any shape, but this operation becomes meaningful only if we can efficiently prune the branches we traverse during the query, and this heavily depends on the region’s morphology.

In practice, there are two main cases we are interested in:

Hyper-spherical regions, as shown in figure 9.25; the geometric interpretation is to query points within a certain distance from a certain point of interest.
Hyper-rectangular regions, as shown in figure 9.26; here, the geometric interpretation is to query points whose values are in certain ranges.

When we deal with spherical regions, we are in a similar situation as the nearest neighbor search: we need to include all the points within a certain distance from a given point. It’s like performing a NN-search, where we never update the distance to the nearest neighbor, and instead of keeping track of only one point (the NN), we gather all points closer than that distance.

In figure 9.25 you can see how we are going to prune branches. When we are at a split, we will certainly have to traverse the branch on the same side of the center of the search region P. For the other branch, we check the distance between P and its projection on the split line. If that distance is lower than or equal to the search region’s radius, it means that there is still an area of intersection between that branch and the search region, and so we need to traverse the branch; otherwise, we can prune it.

Figure 9.25 Region search on a k-d tree returns all the points in a k-d tree within a given hyper-sphere. This means looking for points within a given Euclidean distance from the sphere’s center. We start our search from the root: it’s not within the sphere. The point is on A’s left branch, so we need to traverse it, but even if A is not within the sphere, the split line through it intersects the sphere, so there is a portion of the sphere intersecting the right branch of A as well (highlighted in the top-left figure). For the following steps, we are showing in parallel the execution on all branches at a given level, for the sake of space. (This is also a good hint that these processes could be executed in parallel.)

Listing 9.11 shows the pseudo-code for this method. As you can see, it’s pretty similar to the regular NN-search.

The other region we are going to use for these queries is a rectangular shape. As you can imagine, the only difference with respect to pointsinSphere is the way we check whether we have to prune a branch. If we assume that the rectangle is oriented along the Cartesian axes used for splits, then pruning might even be considered easier, as shown in figure 9.26. But suppose we are at a horizontal split; then we need to understand if the split line intersects the search region, and in this case we will traverse both branches, or if it’s above or below the region, which tells us which branch we can prune. This can be checked by simply comparing the y coordinate current node’s point—call it N_y—with top (R_t) and bottom (R_b) y coordinates of the rectangular region, as we can have three cases:

R_b < N_y < R_t—We need to traverse both branches.
N_y > R_t—We can prune the left branch.
R_b > N_y—W e can prune the right branch.

And similarly for vertical splits, by checking the x coordinates instead of the y coordinates, it can also be generalized for k-dimensional spaces, cycling through dimensions.

This method is particularly useful when we need to search values within simple boundaries for each feature in data. For instance, if we have a dataset containing the tenures and salaries of employees, we might want to search all employees who worked for the company between two and four years and have salaries between 40K and 80K . . . and give them a raise!^[10]

This search, implemented in listing 9.12, translates into a rectangle with boundaries parallel to the dataset feature axes, meaning that in our boundaries each feature is independent of any other feature. If, instead, we had conditions that would mix more than one feature (for instance, a salary lower than 15K for each year of tenure), then the boundaries of the search region would be segments of generic lines, not parallel to any axis.

Figure 9.26 Region search on a k-d tree returns all the points in a k-d tree within a given hyperrectangle. This means looking for points that, for each coordinate, satisfy two inequalities: each coordinate must be within a range. For instance, in this 2-D example, the points’ x coordinates need to be between -2 and 3.5, and y coordinates between -4 and 2.

In that case, the problem becomes harder to solve, and we might need something more complex, such as the simplex algorithm, to search the points satisfying our range query.

What’s the performance of both region-searches? Well, as you can imagine, it heavily depends on the regions we search. We go on a full range from two edge cases:

For very small regions intersecting only a single branch corresponding to a leaf, we will prune every other branch, and just follow a path to the leaf, so the running time will be O(h), where h is the height of the tree – O (log(n) ) for a balanced tree with n points.
When the region is large enough to intersect all points, we will have to traverse the whole tree, and the running time will be O(n).

Therefore, we can only say that the worst-case running time is O(n), even if the methods will efficiently prune the tree whenever possible.

7. A recap of all methods

As we have seen, k-d trees provide a speed-up over brute-force search on the whole dataset; table 9.1 summarizes the performance of the methods described in this chapter. While the worst-case running time for nearest neighbor search (and removal) is still linear (exactly as for brute-force), in practice the amortized performance on balanced k-d trees is slightly to consistently better. The improvement is higher in low dimensional spaces and still consistent in medium-dimensional spaces.

In high-dimensional spaces, the exponential term on k for nearest neighbor becomes dominant and makes supporting the extra complexity of such a data structure not worth it.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Data structures and algorithms