Simulated annealing: Optimization beyond local minima

In chapter 16 we introduced local optimization techniques as a way to improve over mere random sampling algorithms. We discussed gradient descent at length and alluded to a randomized optimization technique. If you remember our analogy to marble races, gradient descent would always move a marble (our solution) along the steepest path on the track (the landscape produced by the cost function), but stop in valleys. Figure 17.1 contrasts the gradient descent approach with the random search local optimization called hill descent. In this case, the algorithm picks a direction randomly and checks if by making a short (possibly random) step in that direction we get to an improvement. If so, it moves to the new solution; otherwise, no change is made, and another attempt is performed at the next iteration. As we also saw in chapter 16, for gradient descent it is, in theory, possible to step over local minima, if the learning rate is large enough; however, this is also unlikely to happen.

Be warned, though, that in figure 17.1 we are not showing all the failed attempts where local random optimization tried the wrong direction: overall, getting to the same point will need more steps than gradient descent, because instead of going in the direction of maximum change, we are wandering around randomly. This is more apparent looking at a 2-D domain/3-D surface, like the one shown in figure 17.2.

As it is, this approach shares some of the same issues we discussed in chapter 16 about gradient descent. It is likely to get stuck in local minima and plateaus. And, even worse, it’s taking a slower way to get there.

On the plus side, it also has some advantages. First and foremost, we can release the constraint about the cost function being differentiable, and can actually even ignore the definition of the function, as long as we have a way to compute it.^[2] This is especially useful with functions such as the edge crossing of an embedding, a step function which is not differentiable in the points where its value abruptly changes and has a derivative identically equal to zero elsewhere.

The obvious downside, as shown in figure 17.2, is that the random optimization is going to require many more steps than gradient descent, because it will take long detours. Even worse, since it will not take the direction where the function decreases faster, it will likely end up in a different spot than gradient descent. It’s impossible to predict where, and whether it will find a better or worse final value. To cope with this, the only strategy is perseverance: increasing the number of runs and storing the best result across them is likely to bring to a satisfying result (the more runs, the better the expected solution).

Finally, there are issues faced in gradient descent that not even the randomized method can shake off.

For instance, while it’s possible that the random optimization algorithm gets out of a local minimum, it’s not guaranteed. Since both algorithms are greedy,^[3] they only move from the current position when they find a domain point to which a lower cost corresponds; it’s possible that the range of the random steps is not wide enough to get it out of a pit. You can see this in figure 17.1, where the “leap” the algorithm has to make to get itself out of a local minimum is pretty large compared to the other updates. If such a large delta is even allowed (depending on how the algorithm is configured through its hyper-parameters), it will likely take a lot of random attempts before it generates a step that is both in the right direction and sufficiently far away.

2. Sometimes you need to climb up to get to the bottom

An example of an even more troubling configuration for the local optimization algorithm is shown in figure 17.3. The gap between the current minimum where the randomized algorithm is stuck and the next point in the cost function’s landscape with a lower value (the closest improvement) is way too far away to get there in a single step. What we would need is for the algorithm to be able to say, fine, it doesn’t matter if I’m in a minimum; I’ll climb over this hill and see if the next valley (or the one after the next) is deeper.

Of course, there are many ways to do that. For instance, we could store somewhere the best solution we have found and keep exploring, moving past a local minimum. Another option could be deciding that sometimes a step uphill is fine. We can plan it systematically, every few steps, or probabilistically, accepting a step uphill with a certain probability.

Simulated annealing uses the latter approach; while in its original formulation it doesn’t keep track of the best solutions found, this can be easily added.

This heuristic^[4] takes its name from a technique used in metallurgy, annealing, that consists of repeated cycles where a material is heated up and then controllably cooled to improve its strength: something like a blacksmith forging an iron sword by quenching it in cold water, but in a more controlled way (and without water!).

Simulated annealing does something quite similar. In its simplest version it consists of a single cooling phase, but variants exist that alternate phases where the temperature has risen, with others when the system has cooled.

The system’s temperature, in turn, is directly connected to the energy of the simulated system and the probability that it is allowed to transition to a higher-energy state (in other words, to a solution for which the cost function takes higher values).

Figure 17.4 shows a possible path on a 3-D surface that resembles the progress of simulated annealing optimization. This time, the algorithm is capable of getting itself out of a local minimum, even by taking a counter intuitive step uphill: that’s the difference compared to greedy algorithms (including the ones summarized in section 17.1) and the true advantage of this technique.

The path shown in figure 17.4, however, though possible,^[5] is not typical of simulated annealing. A more common path would be much more chaotic, and would have made the figure a bit too messy.

The algorithm, in real applications, is more likely to initially jump back and forth, exploring several areas of the landscape, and often moving uphill. Then, as cooling progresses, it will be less likely to transition to higher positions, and at the same time we can directly or indirectly reduce the length of the random steps. Overall, after the initial exploration phase, it should enter a fine-tuning phase.

An example of the whole process is shown in figure 17.5. As you can see, I was not exaggerating about how chaotic it looks!

2. Implementation

So, enough talking about how cool this algorithm is; let’s look at some pseudo-code. Listing 17.1 presents the algorithm, and a JavaScript implementation, provided as part of the JsGraphs library, is linked on the book’s repo on GitHub.

The algorithm looks beautifully simple, doesn’t it? And quite concise, although, as always, we try to present generic methods as templates, and when possible, we abstract as many sub-routines as possible into helper methods that can later be implemented according to the context.

In this case, there are three of them; in the next section we’ll discuss both the function computing how the temperature evolves and the one giving the probability of acceptance of a transition. Now, let’s discuss function randomStep, which allows us to build the next tentative solution to which the algorithm could transition.

This function needs, obviously, to be domain-dependent: the size of the problem space and the type of solutions will determine how we can change the current solution (a point in the problem space). For the graph-embedding problem, for instance, we can randomly move each vertex along both axes, within the maximum area in which the graph must be embedded.

But as you can see at line #4, we added a dependency on the temperature for this function!

We mentioned in the previous section that you can adjust the length of the random step directly or indirectly. The short answer is that you want to be careful if you do it directly, but it’s an option. To see why, we first have to better explain in detail the other two functions and understand why simulated annealing works.

3. Why simulated annealing works

If you feel that the way simulated annealing works is slightly counterintuitive, well, you are not alone. After all, if the algorithm can make large steps, and it can also move to worse positions without making any use of any knowledge of the cost function (either previous or acquired while running), how do we know it will end up in the area of the global minimum and not get stuck in some sort of funnel to a local minimum when we start reducing temperature? Well, we don’t. But with most landscapes, if we run the algorithm long enough and find the right configuration, in practice it will get pretty close to the global optimum and outmatch gradient descent.

There are lots of “ifs” in that proposition. The truth is that, like most heuristic algorithms, it works well in practice when it’s in the hands of someone who knows how to tune it properly and has enough computing time to make it run for lots of iterations.

Still, if that’s the case, then simulated annealing is a great alternative for scenarios where gradient descent would suffer. It’s not that one is better than the other; like all tools, there are problems for which gradient descent works better, and others where it would be hard or impossible to apply it. But let’s see in detail why it does work.

The key to the algorithm is the probabilistic mechanism that allows the algorithm to move uphill (lines #2-5), toward worse values for the cost function, and in particular the function that spits out the probability of accepting a positive delta, depending on temperature and on the magnitude of this delta.

Assuming that at a given iteration, when temperature has the value T, the algorithm attempts to transition from current point P₀ to a point P, then we can express this probability function A(P₀, P, T) as

where C(P) is the cost of solution P (similarly for P₀), and k is a constant that must be calibrated on the initial temperature and max delta in the cost function, so that in the initial phase of the algorithm (when the system’s temperature is close to the initial temperature), the probability of acceptance of a positive delta is close to 1 for any two states.

Typically, the probability of transitioning to a lower-energy state is set to 1, so that such a transition is always allowed in any phase of the simulation.

But of course, this is not the only way to define this function. It is, however, a typical definition for the acceptance probability, because it stems directly from the metal- lurgic analogy: this formula is directly inspired by the Boltzmann distribution that measures the probability that a system is in a state with a certain energy and tempera- ture.^[7] But instead of an absolute value for the energy, for simulated annealing we consider a variation from a lower to a higher energy state.

Now, what’s the effect of this probability distribution on the single step of the algorithm? Let’s consider the case where the amplitude of the update step in the problem space is not constrained (so we could even move between opposite corners of the domain) and take a look at figure 17.6.

Initially, when the temperature of the system is high (figure 17.6 (A)), and by construction the probability distribution should allow almost any update (no matter how bad it seems), the algorithm can move all over the landscape and even get out of local minima, even if this means that it might abandon an optimal position for one of the worst negative peaks. Actually, at this stage, it is even easy for the algorithm to walk away from good solutions.

This mimics high-energy systems, where particles (or molecules) move chaotically in all directions.

As the system cools down, the distribution changes (figure 17.7 (B)). Going uphill becomes less likely, and some positions above a certain delta (with respect to their cost) become completely unreachable.

Finally, when temperature gets close to the T_LOW, the halting temperature (or equivalently, as we’ll see, when we are close to the maximum number of iterations), going uphill becomes so unlikely as to be basically forbidden, and the algorithm can only transition to lower-energy states. It behaves like hill descent. Looking at figure 17.6 (C), however, you can see that there is a difference, and transitions to points far away in the domain are allowed if they correspond to a lower cost. This means the algorithm can still get out of local minima (and it’s hopefully even probable that it will!) and converge to a local optimum.

Notice that the acceptance or rejection of a transition to a new state is not related to the distance of the new state from the current one in the problem space. It’s only based on the energy levels of the two states, which in turn are given by (or at least proportional to) the value of the cost function in those states.

Even more importantly, after the very initial stages where basically all transitions are allowed, then the further uphill a new state is, the less likely it becomes for the transition to be accepted. This is why the algorithm works well, because it progressively encourages transitions toward areas where cost is lower (see figure 17.6 (B-C)), while not limiting the search to the neighborhood of the current position in the problem space.

Why do transitions uphill work? Because when going uphill, it might be able to go over a cliff and reach a deeper valley. While it might seem counterintuitive on a 2-D chart, this becomes even more relevant in high-dimensional space.

Similar to the metallurgic process, however, cooling down the system with the right pace is instrumental to the quality of the final result.

And, of course, being a merely stochastic process, it also needs some “luck”; especially toward the end, many random steps will produce transitions uphill, and as such will be rejected. But if we try hard enough, for many iterations, chances are that a positive change will be randomly found.

And that’s the strength of the algorithm, and, at the same time, its weakness. We can reach an improvement, but since we discard progressively more and more attempted updates, we’ll need several iterations (technically: a lot of them!) and, of course, a good function for the random steps to probe the problem space.

This makes the algorithm particularly slow and resource-consuming, especially if iterations (generating a random point and evaluating the cost function) are expensive to compute.

4. Short-range vs long-range transitions

Now, again, the question arises: Do we want to progressively limit the range of transitions, and in turn how far the algorithm can move at each step in the domain space?

We could make the length of the maximum update step dependent on the temperature parameter T, and as such, it would shrink in time.

But the interesting part is that even if we keep the same maximum step length for the whole process, there is some kind of indirect reduction while the temperature cools. As shown in figure 17.6, the filtering actually happens on the co-domain of the cost function, but the indirect effect is limiting the domain to a subset of the original problem space; and, in turn, the effect is the same as water running down funnels—it gets channeled and velocity slows down (as pressure builds up).

If we further restrict the acceptable transitions based on proximity along the problem space, on one hand we get a greater number of attempted updates in the area surrounding the current point, which could be good if we are close to the global minimum, because it would speed up convergence.

On the other hand, we would likely lose the ability to get out of local minima, which is the best reason to use simulated annealing in the first place.

Therefore, one “safe” solution is to implement the randomStep method as pure, random sampling; however, this means that we will need a lot of iterations to find a transition downhill, and the algorithm will keep bouncing between valleys without focusing on fine-tuning. An interesting compromise could be increasing the probability of small steps, but still allowing far-reaching updates, and even trying them every few iterations. This can be done in combination with shrinking the range of the local- search (the small steps) as the temperature cools down.

The last thing we need to discuss is how to update the temperature. Similar to the acceptance probability, there are several possible viable options for this function— there are no restrictions on it.

Typically, however, geometric (aka exponential) decay is used, with the temperature value updated not every iteration, but after some interval (for instance, every 1000 iterations or so).

The mathematical formula for this function is

Ti = αT_i _ ₁, 0 < α < 1

The temperature at iteration i is a fraction of the temperature at iteration i-1; α, which must be between 0 and 1, controls how much cooler the temperature gets between two iterations.

Now, you might have to fiddle a bit with the value of a and the interval between temperature updates to tune them and get the best results. In general, exponential decay slows down quickly at the beginning and slowly from halfway to the end. Figure 17.7 shows a few examples of how the ease of the slowdown depend on the choice of α.

In practice, α ≈ 0.98 is usually a safe bet for an initial choice (then, you can take it from there and tune it).

5. Variants

As we have seen, simulated annealing—despite being an indisputably powerful tool— also has some flaws, the most concerning of which is its slow speed in converging due to its stochastic nature.

And so, like for all algorithms, over time many variants have been studied to remedy its shortcomings and make it even better.

We already alluded to a trivial modification that we could add: storing the best solution found across all iterations. It might happen, in fact, especially on large, multidimensional problem spaces, that during the initial high-energy phase we serendipitously land in proximity to a global minimum, then move uphill afterward, and never manage to get back to such a good result again. How likely this is to happen depends on many factors, such as the shape of the landscape (the cost function) and whether the right configuration for the parameters was found. In any case, remembering the best result ever can be an easy win.

Pushing this consideration a little further, simulated annealing with restart stores one or a few of the best results found across iterations, and when it gets stuck, it moves to one of these previous (and advantageous) positions, resetting the descent or just moving to a different area of the problem space.

We have mentioned, if you remember, that despite being lower with respect to greedy algorithms, the probability of getting stuck in local minima is still not null.

In particular, this event can be concurrently caused by a few factors:

Non-optimal choice of the algorithm’s parameters. Cooling becomes, for instance, too fast, and the algorithm gets stuck away from global minimum.
Bad luck. As we said, it’s a stochastic algorithm, after all, and it might reach the optimum too soon and never be able to go back once the system cools down.
The update step (more likely so). While we saw that random sampling causes the fewest constraints toward other areas in the problem space with lower energy, it will slow down local convergence (possibly too much to be acceptable).
Sometimes it’s possible to have update rules that allow long steps, while other times (as we’ll see), it might be easier to implement only small steps, which means that those long leaps to different areas that could bring the algorithm outside of local minima are more rare.

When some or all of these happen, using random restart could save the day.

Another issue with simulated annealing is that tuning its parameters can be tricky and annoying. In adaptive simulated annealing the algorithm parameters, k and a, are automatically adjusted as the algorithm progresses, or depending on the trend in the energy level. The latter uses mechanisms borrowed from thermodynamics, allowing even increases in temperature (simulating cycles of cooling down and warming up).

6. Simulated annealing vs gradient descent: Which one should I use?

We have seen that simulated annealing is slower than gradient descent to get to minima: the latter takes the fastest route, so it’s hard to beat on a clear course.

The caveat is that gradient descent requires a differentiable cost function and gets easily stuck in local minima.

Whenever the cost function is not differentiable or step-shaped, as with the minimum-crossing-embedding problem, or the problem space is discrete (for problems like the TSP, which we’ll describe later in this chapter), then simulated annealing is preferred over gradient descent.

Likewise, for problems where we have flexible requirements about running time and the quality of the solution, simulated annealing might still be preferable to gradient descent. Keep in mind that simulated annealing is a Monte Carlo algorithm, and so (as we discuss in appendix F) it returns a sub-optimal solution, whose quality increases with the quantity of time we allot for the algorithm to run.

When, instead, we have guarantees about the differentiability and shape of the function (for instance, if we are sure we have a bowl-shaped function, as with linear/ logistic regression, and so on), then we want to take advantage of the superpowers of gradient descent.

Are there cases where simulation annealing is best avoided?

Obviously, as we have discussed, if a problem doesn’t admit near-optimal solutions, but rather demands the best possible one, then simulated annealing is not the best tool in our belt.

The shape of the cost function matters as well. When the cost function has narrow valleys, the algorithm will have a low probability of finding them, and if they hold the global minima, it’s unlikely simulated annealing will converge to a near-optimal solution.

Finally, as we will see in our examples, an important requirement is that the cost function should be easily calculated for a new candidate solution (possibly allowing us to directly compute the delta based only on the difference with the current solution). Since this cost is computed at each iteration, a computationally intensive cost function will slow down the optimization, forcing us to run the algorithm for fewer iterations.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Data structures and algorithms

Simulated annealing: Optimization beyond local minima