The nearest neighbors search problem and Solutions

1. Thenearest neighbors search problem

Let’s start ourjourney for this chapter from figure 8.1: a map showing a few cities (taken from an alternate universe) and the locations of some warehouses in their area.

Imagine you are living in the 1990s, the dawn of the internet era, with e-commerce taking its very first steps.

You are running an online store where you sell locally produced products by col­laborating with a few retailers. They sell to shops in the real world, and you provide them with the infrastructure to also sell online, for a small price.

Each warehouse will take care of deliveries for its orders, but to gain traction and have more retailers hop on board, you offer a special deal: for each delivery further than 10 km away, you will cut your commission proportionally to the distance.

Back to figure 8.1. You are the chief architect of this company and your main goal is finding a way, when a client orders a product, to come up with the closest warehouse that has the product in stock, and, if possible, keeping within a distance of 10 km.

Long story short, to keep your company in business and to keep your job, it’s vital that you always redirect each user to the closest warehouse.

Imagine that someone from Gotham City tries to order some French cheese. You look at your list of warehouses, compute the distance between your customer’s mail­ing address and each of them, and pick the closest one, P-5. Immediately after that, someone from Metropolis buys two wheels of the same cheese; unfortunately, you can’t use any of the distances computed before, because the source point, the location of the customer, is completely different. So, you just go through that list of stores again, compute all the distances, and choose warehouse B-2. If the next request comes, say, from Civic City, off you go! You once again need to compute all n distances again, to all n warehouses.

2. Solutions

Now, I know, figure 8.1 only shows five warehouses, so it seems a trivial, quick opera­tion to go through all warehouses for each user. You could even handle orders manu­ally, choosing case by case based on your gut feeling and experience.

But suppose that after one year, since your business is working well, more stores have decided to sell on your website, and you have close to a hundred of them in that same area. That’s challenging, and your customer care department can’t cope with a thousand orders a day: manually selecting the closest place for each order doesn’t work anymore.

So, you write a small piece of code that automatically performs the previous steps for each order and checks all the distances for each order.

After another year, however, the business is going so well that your CEO decides you are ready to go national after closing a deal that will see hundreds or thousands of medium and large stores (spread across the country) join your platform.

Computing millions of distances per user starts to seem a little overwhelming and inefficient—also, since you are only in the late ‘90s, servers are not that fast, server farms are a rarity, and data centers are something for large hardware companies such as IBM. They are not yet a thing for e-commerce.

2.1. First attempts

Your first proposal can be precomputing the closest warehouse for each user once and for all products, but that doesn’t really work, because users can and will move, or sometimes they want to have stuff sent to their office or to the post office and not to their homes. Plus, the availability of goods will change over time, so the closest shop isn’t always the best one. You would need to keep a list of shops, ordered by distance, for each customer (or at least each city). Haven’t I already mentioned that data cen­ters aren’t yet a thing?

2.2. Sometimes caching is not the answer

So, this is one of those cases where cache doesn’t really help us a lot. As we mentioned in chapter 7, there aren’t many such situations, but sometimes they do happen.

Reasoning in 2-D, we might try a different approach, inspired by real maps: divide our map into tiles, using a regular grid. This way, we can easily find which tile contains a point from its coordinates (simply dividing the value of each coordinate by the size of a tile; see figure 8.2) and search the closest points in the same tile or in the neigh­boring tiles. This, indeed, seems to help reduce the number of points we need to com­pare; there is, however, a catch. This approach works if data is regularly spaced, which is usually not the case with real datasets, as shown in figure 8.2.

Real data forms clusters, dense conglomerates that gather many points close to each other, alternated with sparse areas with few points. With a regularly spaced grid, the risk is having many empty tiles and a few tiles that gather hundreds or thousands of points, hence defying the purpose of this approach. We need something different, something more flexible.

2.3. Simplifying things to get a hint

The solution to this problem seems elusive. In these cases, it sometimes helps to solve a simplified version of the problem and then come up with a more general solution that works for our initial problem.

Suppose, for instance, that we were able to restrict our search to a 1-D search space. Say that we need to serve only customers on a single road that extends for miles and miles, and all our warehouses are also placed along this same road.

To simplify things even further, suppose that the road is perfectly straight, and that the total distance we cover is short enough that we don’t have to worry about the earth’s surface being curved, latitude, longitude, etc. Basically, we assume that an approximation with a 1-D segment is close enough to our space, and we can use Euclidean distance in 1-D as a proxy for the real distance between cities.

Figure 8.3 might help you picture the scenario. We have a line segment with a start­ing point marked as 0, and the cities and warehouses that we saw in figure 8.1, but now all of them are on the same line.

This is an approximation of the initial scenario, where those points belong to a 2-D plane, which in turn is an approximation of the reality, where the same points are on a 3-D curved surface. Depending on the use case, we might be fine with either approxi­mation, or require a more precise model taking into account the curvature of the earth’s surface.

Given a random point on the segment, we would like to know which one of the reference points is closer. Being in a 1-D case, this looks awfully similar to binary search, right? Check figure 8.4 to see how binary search can be used to find the closest 1-D point.

2.4. Carefully choose a data structure

Binary search on an array is cool, but arrays are not really known for their flexibility (as we discuss in appendix C). If we would like to add another point between W-3 and B-2, for instance, we would have to move all array entries points from B-2 to B-4, and possibly reallocate the array if it’s a static array.

Luckily, we do know a data structure that is more flexible than arrays and would allow us to perform a binary search efficiently. As its name suggests, a binary search tree (BST) is what we are looking for. In figure 8.5 we show a balanced binary search tree; remember that we need the tree to be balanced to guarantee logarithmic run­ning time on the most common operations.

For this example, we show a tree that contains both cities and warehouses. You can imagine, for the sake of simplification, that each city has a big warehouse or distribu­tion center, so our searches can just return the closest entry (either a city or a ware­house) to a customer (that is not in one of the cities in the tree).

And indeed, insertion, removal, and search are guaranteed to be logarithmic on our balanced BST. This is much better than our initial linear time search, isn’t it? A loga­rithmic running time grows amazingly slowly; just think that for a million points, we would go down from a million distances to be computed to just about 20!

Figure 8.6 shows how we would run binary search on the binary search tree shown in figure 8.5 to find the nearest neighbor of a point whose distance from the origin (aka its x coordinate) is 75. If we had an exact match, the nearest neighbor would be the result of binary search. When we don’t have an exact match, which is the most common case, then the nearest neighbor will always be either the node where the binary search fails, or its parent.

So, what’s the algorithm to find the nearest neighbor of a 1-D point, when our dataset is stored in a binary search tree?

  • Run search on the binary search tree.
  • If there is an exact match, the entry found is the nearest neighbor (with dis­tance equal to 0).
  • Otherwise, check which entry is closest to your target between the last entry vis­ited (the one on which search stopped) and its parent.

Now that we have brilliantly solved the problem in 1-D, the question arises: Can we use a similar data structure to solve the problem in 2-D?

3. Description and API

Of course, the answer is yes. Probably the fact that we asked the question already led you to suspect it. But nonetheless, getting from 1-D to 2-D is a big leap. There is no easy way to imagine a tree that works in two dimensions. Worry not, though; we’ll get into a detailed explanation in the next section. Once we have taken that leap, it will be easy to move to 3-D and in general to hyper-spaces with an arbitrary number of dimensions.

We also won’t be limited to datasets that lie in 2-D or 3-D geometric space. The dimensions can be anything, as long as we can define a distance measure on them, with the caveat that the distance measure respects some requirements; namely, it needs to be Euclidian distance. For example, we can have 2-D entries where the first coordinate is their price and the second coordinate is their rating, and then we can ask for the closest entry to a target tuple, like (100$, 4.5 rating). Even more, we will be able to ask for the n closest entries to that target.

In this and the following chapters we are going to describe three data structures, three containers, that allow for efficient nearest neighbor queries. But not just that— they will provide a few special operations:

  • Retrieving the N closest points to a target point (not necessarily in the container)
  • Retrieving all the points in the container within a certain distance from a target point (geometrically interpretable as all points within a hyper-sphere)
  • Retrieving all the points in the container within a range (all points lying inside a hyper-rectangle, or a semi-hyperspace)

Let’s now briefly introduce the three structures we will describe in these chapters:

  • K-d tree—A k-d tree is a special binary tree in which every non-leaf node rep­resents a splitting hyper-plane that divides the k-dimensional space into two half-spaces. Points on one side of this splitting hyper-plane are stored in the left subtree of the node and points in the other half-space created by the hyper­plane are stored in the right subtree. We’ll focus on k-d trees in chapter 9.
  • Rtree—The R here is for rectangle. An R-tree groups nearby points and defines the minimum bounding box (that is, hyper-rectangle) containing them. Points in the container are partitioned in a hierarchical sequence of minimum bound­ing boxes, one for each intermediate node, with the one at the root containing all the points, and the bounding box of each node fully containing all its chil­dren’s bounding boxes. In chapter 10 we will give a brief description of how R-trees work.
  • SS-tree—Similar to R-trees, but Similarity Search Trees use hyper-spheres as bound­ing regions. Hyper-spheres are constructed with a recursive structure: the leaves only contain points, while inner spheres’ children are other hyper-spheres. Either way, a hyper-sphere can gather up to a certain number n of points (or spheres), within a certain distance from the sphere’s center. When it gets more than n children or some of them are too far away with respect to the others, then the tree is rebalanced. We’ll describe in detail how this is done in chapter 10, which is devoted to SS-trees.

And finally, let’s define a generic interface, common to all concrete implementations.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *