Applications of nearest neighbor search: Other applications

Maps aren’t the only field of application for nearest neighbor search. They were not even the application for which k-d trees were originally invented; k-d trees just hap­pened to work very well on this domain, but these containers were meant to solve problems in higher dimensional spaces, rather than in 2-D or 3-D space.

To give you an idea of the vast number of fields that can leverage these algorithms, we’ll briefly go through a few examples on very different domains.

1. Color reduction

The problem is simple: you have an RGB bitmap image using a certain palette; for example, the usual 16 million RGB, where each pixel of the image has three channels associated with it—red, green, and blue—and each of these colors has an associated intensity value between 0 and 255. Figure 11.9 shows an example of how a bitmap image is encoded. For instance, if a pixel is meant to be completely red, it will be associate with the tuple (R=255, G=0, B=0), a darker blue pixel with something like (0, 0, 146), and a yellow one with (255, 255, 0); black is (0, 0, 0), and white is (255, 255, 255).

To store such an image, we need 3 bytes per pixel.[2] For an image with a resolution of 1280 x 720 pixels,[3] this means 2.7649.800 bytes, more than 2 MBs; and 1920 x 1080 pixels[4] images require almost 6 MBs at 16M colors.

To save space, you could either use a compressed format, which could cause a loss of information or transform the data to a different space,[5] or reduce the number of colors you use for each pixel. Suppose that you are storing images for a specific pur­pose, such as images of road signs to be fed to a machine learning model for training.

If you don’t care about the background, road signs themselves use a very limited set of colors, so you can decide that it is fine to down-sample the colors you use to a 256-color scale.[6] This will allow a factor of 3 savings for your storage, which means 4 GBs for every thousand pictures stored (if you are storing or processing them on the cloud, this likely means a huge savings of time and money).

So the problem is this: How do you transform each image from one color scale to the other by keeping the highest fidelity?

You will have to choose 256 colors and “bin” each of the 16M original colors into these 256 buckets. The key is how you choose the destination colors scale. There are, of course, many ways to do that; you could choose the same 256 colors for all images, perhaps by sampling uniformly the original 16M scale, but you could also decide for each image the best scale to reduce the loss of information.

How you choose this scale was actually the subject of an interview question I was asked by one of the best engineers with whom I had the pleasure to work. We’ll leave it as an exercise for the reader to think about how this problem can be best solved, so we avoid spoiling the question.

But once you somehow come up with the best choice for the 256 colors to use, how do you transform each pixel from one scale to the other?

Here is where nearest neighbor search comes into play: we create a k-d tree, or SS-tree, containing each of the 256 selected colors for the destination scale; the dimension of the search space will be, as you can imagine, 3.

For each pixel in the original image, we search its nearest neighbor in the tree, and store the index of the color in the destination scale closest to the original color of the pixel. Listing 11.9 shows pseudo-code for these operations, using an SS-tree.

You can see how simple it is to perform such advanced tasks when you have the right data structure! That’s because all the complexity is encapsulated in the implementa­tion of the SS-tree (or k-d tree or equivalent), and that’s also why the main goal of this book is helping readers to recognize the situations where they can use these data structures—that alone will make you stand out as a developer, producing faster and more reliable code.

2. Particle interaction

In particle physics simulations, scientists need to model systems where a high number of atoms, molecules, or sub-atomic particles interact in a closed environment. For instance, you could simulate the evolution of a gas when the temperature changes, or a laser beam hits the gas, and so on.

Figure 11.10 shows a simplification of what such a simulation could look like. Considering that at an average 25°C room temperature there are approximately 1022 molecules in a square meter of air, you can imagine that even with small boxes and rarefied gases, your simulation should handle billions of billions of items for each step of the simulation.

These particle simulations are all about computing the interaction between parti­cles, but with such numbers, checking how each particle interacts with any other parti­cle is simply not feasible. There would be ~1040 pairs to check, and that number is just too big for any traditional computer.

Moreover, it doesn’t always even make sense: electrical force and gravity have a lim­ited range of action, and so outside of a certain radius the magnitude of the interac­tion between two particles would be negligible.

You see where I’m going, right? This is the perfect use case for either a n-nearest neighbor search, where we approximate the simulation by assuming that each particle is only influenced by the n closest particles to it (and by tuning n we can trade-off pre­cision for speed), or alternatively we can only check the interaction of each particle with the ones inside the radius of action of the four fundamental forces (or a subset of them, depending on the type of particles).

Listing 11.10 describes a possible implementation of such a simulation leveraging an SS-tree to perform range-based queries and filter, for each particle, the surround­ing particles with which it is relevant to compute the interaction. The catch is that the k-d tree (or equivalent DS) used needs to be updated after every step of the simula­tion (since the position of each particle changes), but even so, the speedup that can be obtained is impressive.

3. Multidimensional DB queries optimization

As we saw in chapter 9, k-d trees support multidimensional range queries (MDRQ), searches that select intervals in two or more dimensions of a multidimensional search space. (For instance, as we suggested in chapter 9, a query that searches every employee between thirty and sixty years old and that earns between 40 and 100 thou­sand dollars per year).

These queries are common in business applications, and many databases support optimized techniques to speed them up. While you won’t find it in MySQL, Post- greSQL has supported NN search indexes since version 9, and Oracle implements them in Extensible Indexing.

When indexing tables with a single key field (and many non-key fields), we can use a binary search tree to provide fast (logarithmic) lookup in searches based on the indexed field.

K-d trees provide a natural extension of that use case when we need to index tables with composite keys. While traversing the tree, we will cycle through all the fields in the com­posite key. Moreover, k-d trees provide methods for exact match, best match (the nearest neighbor search), and range search; partial match queries could also be supported.

Figures 11.11 and 11.12 show how a partial range search works on a k-d tree.

Many SQL queries can be directly translated into calls to our data structure’s methods. Let’s see a few examples of how that is possible.

First, let’s set the context. Imagine we have a SQL table with three fields: name, birthdate, and salary. The first two could be enough for a primary key,[8] but we also want to create an index on salary, because for some reason we run a lot of queries on salary. Thus, our index will use a 3-D tree with the same fields.

Table 11.1 shows a few examples of SQL snippets translated into calls to k-d tree’s methods.

4. Clustering

Finally, we get to one of the most important applications of nearest neighbor search: clustering. This application is so important that we will pledge a whole chapter, the next chapter, to explaining two clustering algorithms that use NN search at their core: DBSCAN and OPTICS.

We’ll provide a proper description of what clustering is in the next chapter. For now, suffice it to say that clustering is an unsupervised learning method, where a machine learning model is fed with a set of unlabeled points, and it outputs a possible grouping of these points in meaningful categories. For instance, we could develop a clustering algorithm that, given a dataset of people (with age, education, financial sit­uation, and so on), groups them in categories that share similar interests. The algo­rithm won’t be able to tell us what these categories are, though. It’s a data scientist’s job to study the algorithm output and see, for instance, that one category matches middle-class teenagers, another seems to fit college-graduated baby boomers, and so on. This kind of algorithm is often used to target online advertisement.

Clustering is also used as a preliminary step for other more sophisticated algo­rithms, because it provides a cheap way to break down large datasets into coherent groups. We’ll see this and lot more in the next chapter.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *