Applications of nearest neighbor search: Moving to a distributed application

So far, so good: we have solved the “closest hub” problem assuming we are in control of all the pieces of the applications, and we can be sure that the systems registering the order and starting the process to send goods to customers are always available and never fail.

If only the world worked this way! Unfortunately, not only do systems (applica­tions, computers, networks, and so on) fail, but there is also a good chance that, for an e-commerce web application like the one we described, some crucial pieces are not even under our control. That is, we are likely dealing with a distributed application that includes different services, running on different machines (possibly located far away from each other), and communicating through remote calls over a network. Fig­ure 11.5 builds upon the simplified architecture in figure 11.3, depicting a more real­istic situation, where the retailers’ servers live on their separate machines, and thus in a different addressing space, only accessible through remote calls (over HTTP, or any other communication protocol, such as IPFS).

As you can imagine, this changes the rules of the game. While we can have tight con­trol of what runs on our machine (virtual or physical), once we introduce remote calls, we also introduce additional failure points, and we also need to deal with latency. For instance, if we synchronously call a method in our application, we know it can fail, and hopefully we also know why it would fail. We know it can take some time to com­pute (or, possibly, even loop forever), but we are also sure that the method was called and started doing its job.

Processing an order, when we take all these factors into consideration, becomes sensi­bly more complicated, as shown in figure 11.6.

1. Issues handling HTTP communication

When we move to distributed systems, communication over the network is an addi­tional source of uncertainty. We know that we sent an HTTP request, but if we don’t get an answer, we have no way of knowing if the message arrived, if the network is bro­ken, if the server is broken, or if it is hanging on a time-consuming task.[2]

We therefore have to decide how to handle this uncertainty. First of all, are we going to wait until we get a response (synchronous layout); send a request and do something else while we wait for an answer; or “fire and forget”—that is, send a request and not wait for any answer from the remote server?

This communication channel is part of a workflow of user interaction, with users waiting for a response, and their patience is usually limited. Who would wait 10 min­utes (or just two!) before hitting reload on the page?

And indeed, for an ecommerce page, where the user expects to see a live update within a reasonable time, usually web servers have short timeouts on the calls they receive, meaning they would respond with a 5XX error after a few seconds, usually less than 10.

This introduces additional challenges, because if we keep a longer timeout for our call to the shop’s server, there is a chance that the HTTP call from the customer fails, but our call to the shop succeeds afterward,[3] and we introduce a discrepancy, possibly even causing the customer to buy the same item twice. See figure 11.7 illustrating this case with sequence diagrams.

If the shops’ servers have a timeout set to 8 seconds we need to complete all the remaining operations within 2 seconds, which likely leaves us less than a second to run the nearest neighbor search.

In short, when we move to distributed architectures, there are lot more factors we need to be careful about that go beyond the mere use of algorithms. Nevertheless, the choice of the search algorithm is even more important. A bad choice can have dire consequences for our web application.

That means we need to be careful about the algorithm we implement:

  • Different data structures have different levels of performance for the average and worst case. You can decide to go with the algorithm that is the fastest on aver­age to serve as many requests as possible in the shortest time, or to go with the algorithm that has the best worst-case performance to be sure that all requests will complete within the allotted time (even if, on average, it will be slower).
  • If the dataset to search through constantly grows past a certain point, it will prob­ably become too large to allow you to run a NN search in the time available. At that point, you need to think about other ways to scale your application—for example, with a geographical sharding of the data; or if that doesn’t make sense for your business, with an approximate algorithm leveraging a random sharding with parallelization, and then choosing the best of the solutions returned for each shard.
  • If you are using an approximated algorithm, then usually you have a trade-off between performance and accuracy. In that case, you need to make sure you can compromise on the quality of the results to obtain an answer within the time you can afford to wait.

2. Keeping the inventory in sync

If the situation doesn’t already look complicated enough to you, there is another trou­bling issue that we haven’t considered yet: Where do we get the information about the availability of items?

So far, we have assumed that this information is in the node of our k-d tree (or SS- tree), but that might not be the case. If you think about it, when we place an order to a retailer, their inventory goes down, but the inventory in our container does not nec­essarily reflect that.

There are a number of issues to take into account: the retailer could sell goods through other channels (either another e-commerce site or a brick-and-mortar shop), and we need to update our copy of the inventory when that happens. We need to com­municate over a network, so we need to be careful about race conditions to avoid plac­ing orders twice or missing them.

While we could think of workarounds for both issues, it is best to switch to a differ­ent approach. Figure 11.8 shows an ever more complex architecture for our app that includes a DB (which could be something like Memcached, a SQL DB, or a combina­tion of both) and another service whose goal is just running (as a daemon, inde­pendently on the main app) and periodically asking the retailers’ servers for an updated inventory. Once a response is received asynchronously, the service will update the local DB with the fresh values.

This also means that when we run the nearest neighbor search, we need to make sure that our in-memory copy of the inventory for the shops is up-to-date. Here we will also have to compromise between performance and accuracy, because making one DB call (even if mediated through a fast cache) is likely going to be too slow. So we probably want to have another daemon running on our server’s machine on a thread in the application server that gets the diff from the database, only for the values changed from the last update, goes through the list of shops (kept in some shared memory area), and updates those values.

3. Lessons learned

We have delved into our e-commerce application, iterating from a coarse-grained design for a centralized application to the smallest details of a distributed system.

While this discussion can’t be exhaustive, and isn’t meant to be, I hope it was use­ful to provide you with an idea of how you can structure the design process that leads

from an algorithm to a full, production-ready application that leverages it. Hopefully, it also provided useful pointers to the possible issues you could face in the develop­ment of a web application and what you could look for if you’d like to keep learning in this area.

Now it’s time to move on and present you with a few more problems, in completely different contexts, that can be solved using nearest neighbor search.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *