Similarity Search Trees: Right where we left off

Let’s briefly recap where we left off in previous chapters. We were designing software for an e-commerce company, an application to find the closest warehouse selling a given product for any point on a very large map. Check out figure 9.4 to visualize it. To have a ballpark idea of the kind of scale we need, we want to serve millions of cli­ents per day across the country, taking products from thousands of warehouses also spread across the map.

In section 8.2, we have already established that a brute-force approach is not prac­tical for applications at scale, and we need to resort to a brand-new data structure designed to handle multi-dimensional indexing. Chapter 9 described k-d trees, a mile­stone in multidimensional data indexing and a true game changer, which worked per­fectly with the example we used in chapters 8 and 9 where we only needed to work with 2-D data. The only issue we faced is the fact that our dataset was dynamic and thus insertion/removal would produce an imbalanced tree, but we could rebuild the tree every so often (for instance, after 1% of its elements had been changed because of insertions or removals), and amortize the cost of the operation by running it in a background process (keeping the old version of the tree in the meantime, and either putting insert/delete on hold, or reapplying these operations to the new tree once it had been created and “promoted” to current).

While in that case we could find workarounds, in other applications we won’t necessarily be so lucky. There are, in fact, intrinsic limitations that k-d trees can’t overcome:

  • K-d trees are not self-balancing, so they perform best when they are constructed from a stable set of points, and when the number of inserts and removes is lim­ited with respect to the total number of elements.
  • The curse of dimensionality: When we deal with high-dimensional spaces, k-d trees become inefficient, because running time for search is exponential in the dimension of the dataset. For points in the k-dimensional space, when k « 30, k-d trees can’t give any advantage over brute-force search.
  • K-d trees don’t work well with paged memory, because they are not memory- efficient with respect to the locality of reference, as points are stored in tree nodes, so nearby points won’t lie close to memory areas.

1. A new (more complex) example

To illustrate a practical situation where k-d trees are not the recommended solution, let’s pivot on our warehouse search and imagine a different scenario, where we fast- forward 10 years. Our e-commerce company has evolved and doesn’t sell just grocer­ies anymore, but also electronics and clothes. It’s almost 2010, and customers expect valuable recommendations when they browse our catalog; but even more importantly, the company’s marketing department expects that you, as CTO, make sure to increase sales by showing customers suggestions they actually like.

For instance, if customers are browsing smartphones (the hottest product in the catalog, ramping up to rule the world of electronics, back in the day!), your applica­tion is supposed to show them more smartphones in a similar price/feature range. If they are looking at a cocktail dress, they should see more dresses that look similar to the one they (possibly) like.

Now, these two problems look (and partially are) quite different, but they both boil down to the same core issue: given a product with a list of features, find one or more products with similar features. Obviously, the way we extract these feature lists from a consumer electronics product and a dress is very different!

Let’s focus on the latter, illustrated in figure 10.1. Given an image of a dress, find other products in your catalog that look similar—this is a very stimulating problem, even today!

Figure 10.1 Feature extraction on an image dataset. Each image is translated into a feature vector (through what’s represented as a “black box” feature extractor, because we are not interested in the algorithm that creates this vectors). Then, if we have to search an entry P, we compare P’s feature vector to each of the images’ vectors, computing their mutual distance based on some metric. (In the figure, Euclidean distance. Notice that when looking for the minimum of these Euclidean distances, we can sometimes compute the squared distances, avoiding applying a square root operation for each entry.)

The way we extract features from images completely changed in the last 10 years. In 2009, we used to extract edges, corners, and other geometrical features from the images, using dozens of algorithms specialized for the single feature, and then build higher-level features by hand (quite literally).

Today, instead, we use deep learning for the task, training a CNN on a larger dataset and then applying it to all the images in our catalog to generate their feature vectors.

Once we have these feature vectors, though, the same question arises now as then: How do we efficiently search the most similar vectors to a given one?

This is exactly the same problem we illustrated in chapter 8 for 2-D data, applied to a huge dataset (with tens of thousands of images/feature vectors), and where tuples have hundreds of features.

Contrary to the feature extraction, the search algorithms haven’t changed much in the last 10 years, and the data structures that we introduce in this chapter, invented between the late 1990s and the early 2000s, are still cutting-edge choices for efficient search in the vector space.

2. Overcoming k-d trees’ flaws

Back in chapter 9, we also mentioned a couple of possible structural solutions to cope with the problems discussed in the previous section:

  • Instead of partitioning points using a splitting line passing through a dataset’s points, we can divide a region into two balanced halves with respect to the num­ber of points or the sub-region’s size.
  • Instead of cycling through dimensions, we can choose at every step the dimension with the greatest spread or variance and store the choice made in each tree node.
  • Instead of storing points in nodes, each node could describe a region of space and link (directly or indirectly) to an array containing the actual elements.

These solutions are the basis of the data structures we will discuss in this chapter, R-trees and SS-trees.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *