The nearest neighbors search: Moving to k-dimensional spaces

In section 8.2, we showed that it is possible to efficiently solve the nearest neighbor problem in 1-D by using a binary search tree. If you read through the book and the appendices on core data structures, you should be familiar with binary trees. (If you did skip appendix C, this is your clue: go check out binary trees!)

When we go from 1-D to 2-D, however, the situation becomes slightly more compli­cated, since at each node we don’t have a clear fork between two paths, namely left and right children. We have seen the same concept in ternary trees, where at each node we have a fork with three paths (or more, in n-ary trees), and the direction we have to follow depends on the result of comparison with more than just true/false possible outcomes. But would an n-ary tree help us in this situation? Let’s analyze what happens in 1-D and see if we can generalize.

1. Unidimensionai binary search

To recap what was described in section 8.2, it’s easy to perform binary search when our entries can lie on unidimensional space. Figure 8.7 exemplifies how the entries can be translated to points on a line (that’s basically a unidimensional space) so that each point on that line implicitly defines a left and a right.

So, each node has a value that corresponds to a point on that line, and each point on the line defines a left and right. But wait—in a binary search tree we also have left and right paths for each of our nodes: that’s why it’s so easy to know what to do in binary trees search!

2. Moving to higher dimensions

Now that we understand the mechanism for real numbers, what about R2? What about points in a Euclidean bidimensional space? What about C (the set of complex numbers)?

NOTE Notice that R2 and C are bidimensional Euclidean spaces, and entries of these spaces can be represented with a pair of real numbers.

Well, if we move to higher dimensions, how to run binary search is not as clear as in the unidimensional case.

Binary search relies on recursively dividing a search space into halves, but if you consider a point p in the Cartesian plane, how can we partition the plane into two regions, left and right of P? Figure 8.8 shows a few possible ways to do so.

A visually intuitive approach could be splitting the plane along a vertical line pass­ing through P, so that in our representation of the Cartesian plane, the two semi­spaces will actually be drawn on the left and on the right of P. See figure 8.8.

This solution can look fine while using P as a pivot, but if we take other points, a cou­ple of drawbacks will emerge:

• Looking at point R in figure 8.9, if we draw a vertical line, parallel to the y axis, and use the x coordinate to partition points, we get W, P, O, Q, and U in the left partition, and S and T in the right one. This means that despite U and S being much closer than U and O or S and T, they end up in different partitions (while the other two pairs of points are in the same partition).
• If we consider O, in which partition will Q be? And what about any other point on the y axis? We can arbitrarily assign points with the same x coordinate of our pivot to the left or right partition, but we’ll have to do so for all of them, no matter how far they are from O on the y axis.

Both examples show issues that are derived from the same mistake: we are ignoring the points’ y coordinates alto­gether. Using only a fraction of the available information can’t be ideal; whenever we give up some info about a data­set, we are missing out on the opportunity to organize data more efficiently.

3. Modeling 2-D partitions with a data structure

Using the same, single-direction approach for all points doesn’t really work, so maybe dividing a plane into four quadrants would be a better idea.

Indeed, figure 8.10 shows that this works better than our previous attempts. Of course, since there are four quadrants, left and right partitioning doesn’t apply any more.

We can use a tree where each node has four children instead of two, one child for each possible quadrant. Points on the axes or on the lines passing by a point can be arbitrarily assigned to one of the quadrants (as long as this is done consistently).

This seems to work for R2, allowing us to overcome the main limit we identified for our previous attempt. Both x and y coordinates are taken into consideration to group points together (this would be even more apparent if you add more points to the diagram).

Now the question is whether we extend this solution to R[1]. To move to a 3-D space, we need to answer one question: Into how many sub-hyperplanes do we split the plane for each point? In 1-D, it was two segments for each point, in 2-D it was four quad­rants, and similarly in 3-D we will need eight octants.3

Therefore, for each point in the dataset we will add a node with eight children, one for each octant resulting from splitting the 3-D space along lines parallel to the carte­sian axes and passing through the point.

In general, for a k-dimensional space, we will need 2k children for each node in the tree, because each point would partition the hyperspace in 2k parts.

For real datasets, with the advent of big data, we will have to deal with high­dimensional spaces, meaning that k might easily be in the order of 10 to 30, or even 100. It’s not unusual for datasets to have hundreds of features, and millions of points, and there are use cases where we need to perform nearest neighbor search on these datasets to make sense of them (clustering is one, as we will see in chapter 12).

Even with a smaller number of features, in the order of 10, each node would already have around a thousand children. As we saw in chapter 2 when talking about d-way heaps, when the branching factor of a tree grows too much, the tree flattens and becomes closer to a list.

But with 100-dimensional datasets, the number of children per node would be closer to 1030, a number so large that it becomes challenging to store even a single node of such a tree. We certainly need to do better than that. But how?

As we will see in the next couple of chapters, computer scientists have found a few different ways to cope with these issues. In the next chapter, in particular, we intro­duce k-d trees, a data structure that uses a tiny variation on the approach in section 8.4.3 to avoid the exponential growth of the tree.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)