How optimization works

So, random sampling seems to work. If you try the version implemented in the JsGraphs library, you can see that (on average) the crossing number provided by the random algorithm with restart is better than the one returned by the one-pass random version. Intuitively, we can understand why: if we toss a coin 100 times, it’s easier to get at least one head than if we toss the same coin just once.

To see how this operates in a more formal framework, we could use a visualization of how the optimization proceeds. This is not trivial, because what we are optimizing here is a function of 2n parameters, when a graph has n vertices. For each vertex, we can change its x and y coordinates.

Now, humans are really good at visualizing functions of 1 parameter in a 2-D plot: the x axis is usually the parameter, and the y axis shows the value of the function; with 2 parameters, we can still visualize things meaningfully. Until AR/VR goes mainstream, we have to make do with a 2-D projection of a 3-D plot, which is not ideal, but still feasible (as you’ll see in figure 16.8).

The issue becomes harder when we get to 3 parameters. One workaround is to introduce “time” as the fourth dimension, and as such, the plot can be shown as a 3-D wave that changes depending on the third parameter.

Besides being hard to make sense of, this solution doesn’t solve the problem when we have 4 or more parameters. What we can usually do in these cases is lock k-2 parameters, out of k total variables, and see how the function behaves as the other 2 variables change.

We’ll do something similar with our graph embedding problem to show you how this works, and to better understand our problem.

But first of all, what function are we talking about in this case?

1. Cost functions

We call the target of our optimization a cost function: it expresses well the idea that our solution (each solution we try) has a cost, and that we are trying to minimize it.

There is a vast category of problems called optimization problems for which finding a solution is equivalent to exploring the problem space and eventually picking the solution with the lowest cost. For these problems, usually there are many possible definitions of cost functions that can be used. Some are better than others (we’ll see why), and when we do have a choice, it’s important to spend time and choose wisely, because it can have a great impact on how fast we can find an optimal solution, or sometimes even on whether we can find one at all. In general, however, most optimization problems are proven to be NP-hard, regardless of the specific choice of cost function.

The cost function we have chosen for our graph embedding problem is simply the number of edge intersections for a given embedding. Spoiler alert: this choice is problematic with some optimization algorithms, as we’ll see later.

So, in figure 16.7 we can see the complete graph k₄, an embedding that has 8 degrees of freedom: let’s agree to lock 7 of them, and only allow the horizontal position of vertex v₄ to change.

The cost function looks like a step function, with its value going abruptly from 1 to 0 when the vertex enters the internal face of the sub-graph induced by the other three vertices (the triangle marked by the unnamed vertices in figure 16.7).

If we had allowed vertex v₄ to move both vertically and horizontally, we would have had, instead, a 3-D chart showing a surface.

This is shown in figure 16.8 for the complete graph K₅, or to be more precise, for one of the possible embeddings for this graph.

Notice that here we have surfaces of discontinuity, where the cost function changes its value: when |x|==|y| (v lying on a line passing through the diagonals of the square), and when x or y are equal to 0 or 100 (lines lying on the perimeter of the square). For any point on these surfaces, the cost will be infinite, because as shown in the third example in the figure, vertex v will lie on one of the edges between the other vertices (and our constraints assign an infinite cost to these invalid configurations).

Moreover, as you can see, no matter how hard we try, we can’t find a position for vertex v that will guarantee us an embedding with the minimum possible crossing number; this is because the position of the other 4 vertices is not optimal for a straight-line drawing of K₅.

In turn, the underlying reason is that if we consider the larger problem of finding the best embedding (with none of the vertices locked), we hit a local minimum of the cost function: a point or region in the n-dimensional problem space where the cost function has a lower value than in the surrounding area, but not the lowest possible overall.

If we could take a 2-D projection of the cost function on a generic configuration of the graph (not specifically the embedding in figure 16.8), it might hypothetically look somewhat similar to figure 16.9.

2. Step functions and local minima

Local minima aren’t really good news. Ideally, we would like to have a single global minimum and a cost function that smoothly converges toward it.

A bowl-shaped function, for example, like the section of conic curve shown in figure 16.9, would suit us particularly well and work fine with most learning algorithms.

Why is that? Well, you can imagine an optimization algorithm like a marble that is left rolling on the surface of the cost function. There are, of course, marbles of different weights and friction, and some marbles on some surfaces get stuck and need to be pushed around; likewise, there are different learning algorithms.

If you release a marble on a surface like the cost function in figure 16.9, there is a good chance that it will just stay on the plateau where it lands. If you give it a little push, it might end up in a pit corresponding to a local minimum, and be stuck there, unless you pick it up and release it elsewhere.

Conversely, if you release a marble on a smooth surface like figure 16.10, like when you toss it into a bowl, then it will roll down to the bottom of the bowl, maybe oscillate a little, and in the end settle down in the lowest point, where gravity can’t pull it down any further.

3. Optimizing random sampling

Using this “marble” analogy, an optimization problem can be seen as two things:

The cost function is analogous to the marble track: the smoother its path to the optimal cost, the better algorithms can (in theory) work. Notice that we can build several “tracks” between a starting point and a finish line—engineering the best possible track is part of the solution.
An optimization algorithm is like a marble rolling down the path. For the analogy to be more accurate, though, we need to say that both the marble and the way that it’s tossed and interacts with the track are part of the analogy of the optimization algorithm.

What about our random sampling algorithm? How can we express it in our analogy? Regardless of the cost function, the algorithm does the same thing. Imagine the track is made of sand or mud, so when a marble is tossed onto the track, it digs a small hole in the sand or mud and stops where it lands. This mechanism is independent of the shape of the cost function, and it works the same even with a smooth, bowl-shaped function like the one in figure 16.10.

Figure 16.11 is an attempt at capturing how random sampling (which doesn’t perform any optimization after initialization) works.

A randomized heuristic is like tossing dozens, hundreds, or thousands of marbles at our muddy track (without being able to see where the finish line is). They stay exactly where they land, and in the end, we just choose the one that landed the closest to the finish line. The key is doing many attempts, hoping that at least one will land close enough to the optimal solution. Of course, the track could be so long that no matter how many times we try, we won’t be able to get any closer to a good solution: that’s the case with exponential problems, where the number of possible solutions is huge, when the input is large enough.

Moreover, we need to be careful when using a completely randomized algorithm. We might toss several marbles in the same position, and we could generate the same solution more than once.

If there is a good thing about this approach (besides being extremely cheap), it’s that the shape of the cost function doesn’t matter. We don’t need to engineer a good “track,” because there isn’t going to be any “marble rolling” after initialization.

At the same time, this is also the worst part—we can’t take advantage if we have a good cost function that smoothly degrades towards a global minimum.

If we think about marbles races, we can perhaps find a workaround for that. If you’ve ever played with marble on the beach, you probably know what to do when one is stuck in a pit on the sand track—you give the marble a nudge to get it out of there and start rolling again.

The analogy with randomized algorithms is local optimization. We can have sub-heuristics performing a local optimization; for example, trying to move vertices around one by one within a short distance from the randomly assigned initial position and checking to see if the crossing number improves.

This algorithm is called hill climbing, or in our case, since we try to minimize a function instead of maximizing it, we can call it hill descent.

Figure 16.12 visualizes the analogy. In our case, despite the nudge, the marble only travels a short distance (well, it’s a muddy track, isn’t it?), but we might still get some improvement. What we really do is explore the cost function in a small area around each solution, and if we find a position for which the cost is lower, we move our “marble” there.

If you look closely at figure 16.12, you can see that in this case, the shape of the cost function does matter. While with a differentiable, bowl-shaped function we always get an improvement, with a step function (like our example, “minimum number of intersections”), sometimes the marbles are stuck in local minima and sometimes they are on large plateaus, so moving them around won’t get us anywhere better.

Another thing that we must note is that with this algorithm, we have to try moving in both directions. By randomly exploring the area surrounding a solution, we are blindly poking similar solutions, but without any rationale. For instance, in the bottom example in figure 16.12, by looking at the marbles and the shape of the cost function, we would know that in order to get an improvement, we should increase the value of the only parameter for solutions 1 and 3, and decrease it for solution 2. Using hill descent, we would try to both increase and decrease it for all solutions, and just see what happens. For functions of more than a single parameter, we would either search the surrounding area around the current solution, or probe a random direction, and move if we can get a better value.

Applying this to our graph problem, this means we would move a vertex in all directions, and each time compute the edge intersections of the new embedding.

And if this looks bad with a 2-D cost function, remember that as the dimensionality of the search space grows, we face the curse of dimensionality (see chapters 9-11), and we will have 2n parameters to tune for a graph with n vertices.

So, we might want to try a more efficient approach.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Data structures and algorithms