Simulated annealing + traveling salesman

I hope you have found the discussion about simulated annealing interesting so far. We learned about a tremendously useful tool, and now it’s time to put it into practice; luckily, we have just the perfect application for it!

Do you remember our e-commerce company? We left it in chapter 14 dealing with deliveries, optimizing routes for single-destination deliveries inside town.

As we mentioned, planning individual deliveries from a warehouse to customers for each order is uneconomic and unrealistic.

This doesn’t mean that what we learned in chapter 14 about optimizing routes with Dijkstra’s and A* algorithms was useless. It’s quite the opposite: that’s some finegrained optimization that we can always perform for a single delivery, going from the generic i-th destination to the next one. Since we will have to compute these routes on the fly, it’s even more important to use an efficient algorithm.

And yet, since to amortize delivery costs and stay on the market we need to load each truck with several shipments (and possibly load each truck to maximum capacity) and have it go out on daily tours with several deliveries, finding the optimum path from a source to a destination is not enough anymore.

In this section, we’ll focus on optimizing the route of a delivery truck, assuming its load (and in turn its destinations) are already fixed. Once we have improved this phase, we still have another, higher-level optimization in front of us: assigning the deliveries to the trucks to minimize the distance (or travel time, or, more likely, cost) across all trucks and all deliveries. But we need to take one step at a time, so in this chapter we’ll focus on the following problem: given a list of cities, each connected to each other by roads, find the optimal tour, aka a sequence of the cities such that we move from each city to the next one, at some cost (for instance, the distance between the two cities) and eventually return to the first city while keeping the total cost minimal.

Figure 17.8 illustrates this situation: you can see 10 real and DC Universe cities, the connections between each pair of cities, and—highlighted—the shortest tour touching each city exactly once.

Now, we are considering a single delivery per cityjust to make the example clearer, with all steps being at the same scale. Nothing changes, however, if instead we have multiple deliveries in each city. Because the intra-city distances will be smaller than the ones between different cities, deliveries in the same city will be naturally clustered together. It is also likely possible for the problem to be optimized at different levels in two steps: first, find the best order in which cities should be visited, and then, within each city, compute the best route, using the same algorithm.^[1] But these are just low- level details.

The abstract formulation of this riddle is a well-known computer science problem known as the traveling salesman problem (TSF). As you might have guessed, it’s a difficult puzzle to solve, in particular a NP-complete problem; this implies it’s both NP-Hard and in NP—the former meaning that there is no known deterministic algorithm that can solve it in polynomial time, while the latter means that if we have a candidate solution, we can check it in polynomial time.

Informally, we can say that we expect any deterministic algorithm that solves the TSP will need exponential time (although we can’t be sure, because we don’t have an answer to the P vs NP problem).

The first consequence for us of the TSP being NP-Hard is that we can’t assume we’ll be able to solve an instance of this problem on the fly, on a driver’s mobile phone, or even a laptop computer (unless the number of cities to deliver to is small; but for the 10 cities in figure 17.8, there are already 10! ~ 3.6 million possible sequences). We need computer power, and planning ahead, and possibly even computing ahead, to reuse the results as much as possible.

Even with a supercomputer at our service, in fact, the running time of an exact algorithm just grows too fast. With 15 cities, there are ~1.3 trillion possible solutions, which becomes 2.4e18 (2 billion of billions) with 20 cities. Even assuming that we can find an algorithm running in exponential time (which is asymptotically better than factorial), we couldn’t handle more than ~40 different deliveries (probably far fewer).

1. Exact vs approximated solutions

The last consideration we make is that we can afford to compromise on the optimal solution. We can be fine using a near-optimal solution; we don’t need the best possible. For instance, if we add a few miles over a total of 1,000 miles per trip, it feels acceptable, while doubling the total distance of the tour would be quite expensive, and making it 10 times costlier would be a disaster.

This is not true for every context, and there are situations where we need the absolute best because even a small difference is translated into a large cost, or maybe it could help save lives. In surgery simulations, for instance, a small error can have dire consequences, as you can imagine.

But since we are okay with sub-optimal solutions, this means we can use a heuristic to get an acceptable solution within a reasonable time.

Several heuristics have been developed specifically for TSP. For instance, for graphs whose distances obey the triangle inequality (like ours), a class of algorithms using the minimum spanning tree of a graph (see section 2.8.2) and running in linearith- mic time, O(n*log(n)), can guarantee a solution whose cost is at most twice the minimum cost, and on average^[2] between just 15 and 20% higher than the shortest tour.

We want to try a different way, though. This is clearly an optimization problem, so why not try to tackle it with our new shining tool, simulated annealing?

Simulated annealing can potentially bring us a better solution than the average provided by the MST heuristic; it can even lead us to global optimum. Moreover, let’s assume we don’t have expertise on the specific problem inside the company: overall, it might be worth trying simulated annealing, which is high-level and potentially could be reused for other optimization problems in the future.

When it comes to the cost function, we are in luck. It naturally stems from the problem’s very definition: it’s the sum of the edges between adjacent vertices in the sequence (including between its last and first entries, of course).

2. Visualizing cost

One nice thing about the solutions to the TSP is that the problem space is the set of all the possible permutations of a graph’s vertices, and since each sequence can be mapped to an integer, we have a way to show a nice 2-D chart for the cost function, and even see how the algorithm progresses. The problem space, of course, is huge when the number of vertices grows, as it appears clearly in figure 17.9, where we have to zoom in on a small portion of the domain to be able to distinguish the landscape of the cost function.

To provide a clearer view and description of the process, we need to keep the set of cities small: for instance, we can restrict to the six DC Universe cities in figure 17.8, obtaining the graph shown in figure 17.10, for which there are just 720 possible permutations of the vertices. This produces the landscape shown in figure 17.11, which is far less clogged than what we saw in figure 17.9.

Figure 17.10 Solving TSP for a subgraph of the K₁₀ graph shown in figure 17.8. The figure highlights the K₆complete graph formed by the cities belonging to the DC Universe, and the best solution found. The rest of the vertices/edges of the original graph are grayed out for clarity.

Now, since we have the full landscape, you might point out that we could just brute- force search it and find the best solution, and you’d be right; we have evaluated the cost of each single permutation to draw the chart in figure 17.11, so we can just extract its minimum. As shown in the figure, the best solution is the sequence:

The point is that we wouldn’t be able to do that for larger instances. With the full graph in figure 17.8, for example, it already takes several minutes to generate all 3.6 million possible permutations!

3. Pruning the domain

The reason we wanted to show you the cost function on the full domain of the subproblem with six vertices is because this chart can teach us a lot. For instance, we can see there are many local minima that appear to have the same cost. Can you guess why that is?

As always, take a couple of minutes (if you’d like) to think about the answer, before moving on to the solution.

To answer the question, consider these permutations of the cities in the graph:

[Opal City, Civic City, New Carthage, Happy Harbor, Gotham City, Metropolis]

[New Carthage, Happy Harbor, Gotham City, Metropolis, Opal City, Civic City]

What’s the difference between these two permutations, in terms of solutions? Since we are considering closed tours (from the first city, through all other cities, and then back to the start), they are completely identical, except the second one is shifted left by two cities. In fact, the cost of the two tours is the same (because they involve the same edges).

There are six equivalent sequences that can be derived from the one we gave as a solution, one for each city used as a starting point.

We can therefore fix in advance the city from which the tour starts, knowing that this won’t affect the result, but it will cut the number of permutations we need to examine. From 720 we go down to only 120—not bad, and it will be an even better gain for larger graphs.

Moreover, consider that this will work well with the specific instance of the problem we need to solve for our e-commerce company. We always need to start (and end) our tours at the warehouse, where the goods to deliver are loaded on the trucks.

If we set, for instance, New Carthage as the starting point, there might still be several equally good solutions (if multiple sub-paths have the same total cost), but if the graph is undirected there will only be, at most, two equivalent solutions:

[New Carthage, Happy Harbor, Gotham City, Metropolis, Opal City, Civic City]

[New Carthage, Civic City, Opal City, Metropolis, Gotham City, Happy Harbor]

That’s because if edges have the same cost in both directions, we can travel through a simple cycle (a tour) in either direction (in figure 17.10, clockwise or counter-clockwise).

But we don’t mind having two possible global solutions, so there is no further action we should take here.

4. State transitions

Now it’s time to translate this constraint into code. Luckily for us, that’s not too difficult, given how we designed the simulated annealing algorithm in listing 17.1. We need to pass the definition of the function that computes the transition to the next state, and right there, we can always set the first vertex of the sequence to the same value. We also have the cost function figured out, so we are ready to implement all the missing pieces.

Listing 17.2 starts by exploring the cost function, which, as we have discussed, is the sum of the weights of the edges between two adjacent vertices in the permutation. Still, while computing this value we need to be careful about a couple of details. First and foremost, we assume that the input graph has an edge between every pair of vertices. If this isn’t true, we need to check for it and return a special value (for instance, infinity, or any large-enough weight) to basically guarantee any solution including missing edges will be naturally discarded. Another alternative could be checking the solutions when transitions are computed and making sure that those with missing edges between adjacent vertices will be discarded early.

The other detail we would like to highlight is that we need to wrap around the array, because we have to add the cost of the edge between the last and first vertices in the sequence (the edge that closes the tour). This can be handled in many ways, using modulo being the most succinct, while the most efficient would be treating this last edge as a separate case outside the for loop.

When it comes to the transitions to a new solution, we have a few options:

Swapping adjacent vertices—After selecting a random position in the sequence, we would swap the vertex at the given index with its successor. For instance, the solution [1,2,3,4] could change into [2,1,3,4] or [1,3,2,4], and so on. This transition corresponds to a local search, where only the immediate neighborhood of the current solution is explored; this raises the probability that fine tuning leads to improvement, but makes leaps and getting out of local minima harder.
Swapping any pair of vertices—Both positions for the vertices to swap are chosen at random, and they are simply swapped: [1,2,3,4] can also become [3,2,1,4], which was not allowed by the previous operator. These kinds of transitions allow medium-range search, although the two solutions are still quite close to each other.

Generating a new sequence at random—This allows us to leap across the whole domain at any stage of the algorithm, making it less likely to lead to fine-grain improvements toward a local minimum in the final stage of the simulation, because relatively few attempts will be made in the neighborhood of current solutions. At the same time, it will also leave the door open for distant leaps to any area of the domain, whenever a better solution is (randomly) found.

And, of course, many more intermediate options are available, such as doing a fixed or random number of swaps per transition or moving a whole segment of the solution elsewhere.

Which one of these works best? That’s a good question, which is hard to answer from a theoretical point. Let’s try them out on our K₁₀ graph, as shown in figure 17.8, and see which one gives the best average result. We will perform 1,000 simulations, each running for the same number of steps and with the same values for k and a, and compare the average cost for the solutions. We’ll also assume that each sequence starts with “New Carthage”, so that we cut duplicate solutions by a factor 10. Results are summed up in table 17.1.

Of course, the domain this time is still huge, ~3.6 million permutations, but not so large that we can’t afford brute-force (armed with some patience; it takes a while); therefore, we know that the best possible solution has cost 1625 (as shown in figure 17.8):

[“New Carthage”, “Syracuse”, “Buffalo”, “Pittsburgh”, “Harrisburg”,

➥ “Opal City”, “Metropolis”, “Gotham City”, “Civic City”]

It’s interesting that the best result is obtained with the medium-range search, while the worst one is given by the local search. This means that with the configuration used, the local search gets stuck in local optima, while the random permutation is too erratic and fails to obtain local convergence in the final stage of the algorithm, when the temperature gets low.

We must stress that with different parameters, this could change completely. Take, for instance, a, the decaying factor: with a difference choice, the cooling process would be slower, so we wonder, could this allow the local search to work better? Let’s try it out; the results are in table 17.2.

Increasing the temperature decay rate from 0.97 to 0.99 means that the cooling is performed more slowly and uniformly (you can refer to figure 17.7 to visualize the decay curves). In turn, this seems to help when we use only local search around the current solution, while making thing worse when used for full-domain searches. The medium- range search performed through random vertex swaps gets even closer to the best cost, on average.

These results show a couple more things that are worth highlighting:

Even on a small case (10 vertices) and running thousands of iterations, we get arbitrarily close to the best solution, but we don’t always get the best result. That’s a calculated risk with heuristics.
Finding the best configuration for an optimization algorithm requires time, experience, and sometimes a bit of luck.

We also tried larger values for alpha, such as a=0.995, which are not shown in the table, partly because they lead to poor results. Probably the cooling becomes too slow and the algorithm gets closer to random sampling (this suspicion is corroborated by the fact that with a smaller value for k, the normalizing Boltzmann constant, the deterioration in the results smoothens).

To conclude this section on TSP, listing 17.3 shows a method that implements a random transition from current state to the next one, by combining all three methods discussed so far. An implementation of these methods for library JsGraphs can be found on GitHub.

How does this method comparatively perform? Table 17.3 compares it to the three “pure” solutions: it manages to get the best of all the possible strategies and drives the average cost down with all choices for α.

This seems to suggest that an ensemble method, with the right ratio between the long and distant local searches, can leverage the strengths of both types of transition heuristics.

The best result, with the ensemble method, is obtained with a larger decay rate, so with a more uniform cooling process.

Figure 17.12 shows the algorithm in action: how cost evolves while cooling the system. While initially the cost is fluctuating, the oscillation becomes progressively narrower and, once the temperature is low enough, search is channeled toward the global minimum.

We haven’t explored all the possible values for a, k, T₀, and even the relative probabilities of applying each of the three transition operators within the ensemble method could be further tuned. To be more systematic, we should build a few more examples with different graphs, and then write a small piece of code that changes one parameter at a time and records the mean or median cost found. Try it out using the code included in JsGraph library as a starting point; it could be a nice exercise.

5. Adjacent vs random swaps

The last point I’d like to briefly discuss is why it seems that swapping random vertices works better than swapping adjacent pairs.

One factor that contributes to this can be seen with the following example, illustrated in figure 17.13: it shows a case where the local-search transition heuristic, swapping only adjacent pairs, might struggle. First, we have an example of a swap between two random vertices (17.13 (A)) that grants a good improvement in the cost function. This transition will be always accepted, regardless of the temperature of the system.

When only swaps between adjacent pairs are allowed (17.13 (B)), in this example several pejorative moves (two, in this case, but it can be made an arbitrarily long sequence) must be accepted before getting to the same final configuration shown in figure 17.13 (A). Accepting a sequence of transitions uphill might happen in the early stages of the cooling process, but as we get to the end of the cooling cycle, it becomes progressively less likely for a sequence of bad moves to be accepted. Hence, by using only narrow search, the algorithm might find it impossible to improve an intermediate result.

The consideration that the acceptance rate for sequences of moves exponentially decays with the temperature can also suggests a different approach for our randomStep method—passing the temperature along, and making narrow search more likely to happen in the early stages (when the temperature is higher), while applying an inverse dependence for long search probability, which could then be more likely in late stages.

I encourage you to try it out by changing the code cloned from JsGraphs’s GitHub repo. Getting your hands dirty is a great way to become familiar with simulated annealing.

6. Applications of TSP

Besides our issue with delivery routes, efficient TSP approximation algorithms can benefit a lot of real-world scenarios.

The story goes that one of the pioneers of research about the TSP was motivated, in the 1940s, by the need to optimize bus routes to pick children up for school.

Later, this problem was applied pervasively to the logistic of most services that involve planning tours, from mail delivery to farming.

In time, cable companies started to use it to improve scheduling of service calls, and more recently (and more relevantly to our field), it has become a must to make the automated process of drilling holes in and soldering printed circuit boards (PCB) more efficient, and even to aid some processes in genome sequencing.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Data structures and algorithms