Similarity Search

Before discussing how to improve the balancing of SS-trees, we can finish our discus­sion of their methods. So far, we have seen how to construct such a tree, but what can we use it for? Not surprisingly, nearest neighbor search is one of the main applications of this data structure; you probably guessed that. Range search, like for k-d trees, is another important application for SS-trees; both NN and range searches fall into the category of similarity search, queries on large, multi-dimensional spaces where our only criterion is the similarity between a pair of objects.

As we discussed for k-d trees in chapter 9, SS-trees can also (more easily) be extended to support approximated similarity search. If you remember, in section 9.4, we mentioned that approximate queries are a possible solution to k-d tree perfor­mance issues. We’ll talk in more depth about these methods in section 10.4.2.

1. Nearest neighbor search

The nearest neighbor search algorithm is similar to what we saw for k-d trees; obvi­ously, the tree structure is different, but the main change in the algorithm’s logic is the formula we need to use to check whether a branch intersects the NN query region (the sphere centered at the query point and with a radius equal to the distance to the current guess for the nearest neighbor)—that is, if a branch is close enough to the tar­get point to be traversed. Also, while in k-d trees we check and update distance at every node, SS-trees (and R-trees) only host points in their leaves, so we will only update the initial distance after we traverse the tree to the first leaf.

To improve search performance, it’s important to traverse branches in the right order. While it is not obvious what this order is, a good starting point is sorting nodes based on their minimum distance from the query point. It is not guaranteed that the node that potentially has the closest point (meaning its bounding envelope is closest to the target point) will actually have a point so close to our target, and so we can’t be sure that this heuristic will produce the best ordering possible; however, on average it helps, compared to following a random order.

To remind you why this is important, we discussed in section 9.3.5 how getting to a better guess of the nearest neighbor distance helps us prune more branches, and thus improve search performance. In fact, if we know that there is a point within distance d from our query point, we can prune all branches whose bounding envelopes are fur­ther than d.

Listing 10.19 shows the code for the nearestNeighbor method, and figures 10.20 and 10.21 show examples of a call to the method on our example tree. As you can see, the code is quite compact: we just need to traverse all branches that intersect the sphere centered at the search points and whose radius is the distance to the current nearest neighbor, and update the best value found in the process.

Of the helper methods in listing 10.19, it’s important to spend some time explaining function nodeDistance. If we refer to figure 10.22, you can see why the minimum dis­tance between a node and a bounding envelope is equal to the distance between the centroids minus the envelope’s radius: we just used the formula for the distance between a point and a sphere, taken from geometry.

We can easily extend the nearest neighbor search algorithm to return the n-th nearest neighbor. Similar to what we did in chapter 9 for k-d trees, we just need to use a bounded priority queue that keeps at most n elements, and use the furthest of those n points as a reference, computing the pruning distance as the distance from this point to the search target (as long as we have found fewer than n points, the pruning distance will be infinity).

Likewise, we can add a threshold argument to the search function, which becomes the initial pruning distance (instead of passing infinity as the default value for nnDist), to also support search in spherical regions. Since these implementations can be trivially obtained, referring to chapter 9 for guidance, we leave them to the readers (for a hint, check the implementation on the book’s repo on GitHub).

2. Regionsearch

Region search will be similar to what we have described for k-d trees—the only differ­ence being how we need to compute the intersection between each node and the search region, besides the structural change due to the fact that points are only stored in leaves.

Listing 10.20 shows a generic implementation of this method for SS-trees that assumes the region passed as argument includes a method to check whether the region itself intersects a hyper-sphere (a node’s bounding envelope). Please refer to section 9.3.6 for a detailed explanation and examples about search on the most com­mon types of regions, and their algebraic meaning.

 

3. Approximated similarity search

As mentioned, similarity search in k-d trees, as well as R-trees and SS-trees, suffers from what is called the curse of dimensionality: the methods on these data structures become exponentially slower with the growth of the number of dimensions of the search space. K-d trees also suffer from additional sparsity issues that become more relevant in higher dimensions.

While using R-trees and SS-trees can improve the balance of the trees and result in better trees and faster construction, there is still something more we can do to improve the performance of the similarity search methods.

These approximate search methods are indeed a tradeoff between accuracy and performance; there are a few different (and sometimes complementary) strategies that can be used to have faster approximate queries:

  • Reduce the dimensionality of the objects—Using algorithms such as PCA or Discrete Fourier Transform, it is possible to project the dataset’s object into a different, lower-dimensional space. The idea is that this space will maintain only the essential information to distinguish between different points in the dataset. With dynamic datasets, this method is obviously less effective.
  • Reduce the number of branches traversed—As we have seen in the previous section, our pruning strategy is quite conservative, meaning that we traverse a branch if there is any chance (even the slightest) that we can find a point in that branch closer than our current nearest neighbor. By using a more aggressive pruning strategy, we can reduce the number of branches (and ultimately dataset points) touched, as long as we accept that our results might not be the most accurate possible.
  • Use an early termination strategy—In this case, the search is stopped when the cur­rent result is judged good enough. The criterion to decide what’s “good enough” can be a threshold (for instance, when a NN closer than some distance is found), or a stop condition connected to the probability of finding a better match (for instance, if branches are visited from closer to further with regard to the query point, this probability decreases with the number of leaves visited).

We will focus on the second strategy, the pruning criterion. In particular, we can provide a method that, given a parameter e, called the approximation error, with 0.0 ≤ ϵ ≤ 0.5, guarantees that in an approximated n-nearest neighbor search, the n-th nearest neigh­bor returned will be within a factor (1+ ϵ) from the true n-th nearest neighbor.

To explain this, let’s restrict to the case where n==1, just plain NN search. Assume the approximated NN-search method, called on a point P, returns a point Q, while the real nearest neighbor of P would be another point N#Q. Then, if the distance between P and its true nearest neighbor N is d, the approximated search distance between P and Q will be at most (1+ϵ)*d.

Guaranteeing this condition is easy. When we prune a branch, instead of checking to see if the distance between the target point and the branch’s bounding envelope is smaller than the current NN distance, we prune unless it is smaller than 1/(1+ϵ) times the NN’s distance.

If we denote with Z the query point, with C current nearest neighbor, and with A the closest point to Z in the branch pruned (see figure 10.21), we have, in fact

So, if the distance of the closest point in the branch is higher than the current NN’s dis­tance over the reciprocal of (1+e), we are guaranteed that the possible nearest neighbor held in the branch is no further than an epsilon factor from our current nearest neighbor.

Of course, there could be several points in the dataset that are within a factor (1+ϵ) from the true nearest neighbor, so we are not guaranteed that we get the second-closest point, or the third, and so on.

However, the probability that these points exist is proportional to the size of the ring region with radii nnDist and nnDist/ (1+ϵ), so the smaller we set e, the lower the chances we are missing closer points.

A more accurate estimate of the probability is given by the area of the intersection of the aforementioned ring with the bounding envelope of the node we skip. Figure 10.23 illustrates this idea, and shows the difference between SS-trees, k-d trees, and R-trees: the probability is maximum when the inner radius is just tangent to the area, and a sphere has a smaller intersection, with respect to any rectangle.

If we set ϵ==0.0, then we get back the exact search algorithm as a special case, because nnDist/(1+epsilon) becomes just nnDist.

When ϵ>0.0, a traversal might look like figure 10.24, where we use an example slightly different from the one in figures 10.20 and 10.21 to show how approximate NN search could miss the nearest neighbor.

In many domains we can be not just fine, but even happy, with approximate search. Take our example of the search through a dataset of images: Can we really be sure that an exact match is better than an approximate one? If we were working with geo­graphic coordinates—say, on a map—then a factor e difference could have dire conse­quences (at the very best, taking a longer route would be expensive, but it might get as bad as safety issues). But when the task is finding the dresses that most closely resem­ble a purchase, then we can’t even guarantee the precision of our metric. Maybe a couple of feature vectors are slightly closer than another pair, but in the end, to the human eye, the latter images look more similar!

So, as long as the approximation error e is not too large, chances are that an approximated result for “most similar image” will be as good as, or maybe better, than the exact result.

The interested reader can find plenty of literature on the topic of approximated similarity search and delve deeper into the concepts that we could only superficially examine here. As a starting point I suggest the remarkable work of Giuseppe Amato.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *