K-d trees: Limits and possible improvements

If we look back at our “find the closest hub” problem, we started this chapter with basi­cally nothing better than brute-force search to find the nearest neighbor of a point in a multi-dimensional dataset. Then, going through a few less-than-ideal attempts, we built our understanding of the traps and challenges in such a task and finally intro­duced k-d trees.

K-d trees are great because they offer a major speed-up over linear search; yet, they still have potential issues:

  • K-d trees might be hard to implement efficiently.
  • K-d trees are not self-balancing, so they perform best when they are constructed from a stable set of points, and the number of inserts and removes are limited with respect to the total number of elements. Unfortunately, static datasets are not the norm in the big-data era.
  • When we deal with high-dimensional spaces, k-d trees become inefficient. As we have seen, the time needed for removal and nearest neighbor search is expo­nential in k, the dimension of the dataset. But for sufficiently large values of k, this can hinder any performance benefits over brute-force search. We have seen that nearest neighbor searches perform better than naive search if n > 2k, so starting at k « 30, we would need an ideal dataset (one with a regular distribu­tion of points) with billions of elements in order for k-d tree to overperform brute-force.
  • K-d trees don’t work well with paged memory; they are not memory-efficient with respect to the locality of reference because points are stored in tree nodes, so close-by points won’t lie close by memory areas.
  • While they handle points well, k-d trees can’t handle non-punctiform objects, such as shapes or any object with a non-zero measure.

The inefficiency of high-dimensional datasets stems from the fact that in these datasets, data becomes very sparse. At the same time, when we traverse a k-d tree during NN- search, we can only prune a branch when it doesn’t intersect the hyper-sphere centered in the target point and with a radius equal to the minimum distance found so far. It is highly likely that this sphere will intersect many of the branches’ hyper-cubes in at least one dimension.

To overcome these limitations, we could try a few approaches:

  • We could use different criteria to decide where to partition the k-dimensional space, using different heuristics:
    • Don’t use a splitting line passing through points, butjust divide a region into two balanced halves (either with respect to the number of points or the sub­regions’ size; basically, choose the mean instead of the median).
    • Instead of cycling through dimensions, choose at every step the dimension with the greatest spread or variance, and store the choice in the tree node.
  • Instead of storing points in nodes, each node could describe a region of space and link (directly or indirectly) to an array containing the actual elements.
  • We could approximate nearest neighbor search. For instance, we could use locality sensitive hashing.
  • Or we could find new ways to partition the k-dimensional space, ideally trying to reduce sparsity.

Heuristics help on some datasets, but in general won’t solve the issue with higher dimensional spaces.

The approximate approach doesn’t produce an exact solution, but there are many cases where we can settle with a sub-optimal result, or we can’t even define a perfect metric. For example, think about retrieving the closest document to an article or the closest item to something you want to buy, but is out of stock. We won’t go on this path for now. Instead, in the next chapter we will delve into the latter approach with SS-trees. We will also defer to chapter 11 the discussion about applications of nearest neighbor search.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *