Gradient descent for graph embedding

So, now that we have discussed at length how gradient descent works and its (many) strengths and flaws, you should be able to figure out how to apply it to our case study, a heuristic to find a straight-line drawing for graphs, with the minimum number of intersections between edges.

And your answer should be . . . that gradient descent can’t really help. If we look at figures 16.7-16.12, it’s clear that the cost function for “minimum number of intersec­tions” is step-shaped, with large plateau regions (where the gradient is null) and sud­den drops.

You might be wondering, then, why we’ve introduced gradient descent. Some read­ers could guess the next step, but if you haven’t take a minute to mentally go over what we’ve learned in the last couple of chapters, and then let’s delve into our next challenge.

Before revealing it, let me also highlight that the discussion in sections 16.3 and 16.4 allowed us to develop a better intuition about cost functions and provided a semi­formal characterization: even if the reward was just that, we wouldn’t have wasted our time, because the framework we have established will help us describe and understand the algorithms presented here and in the next two chapters.

But there is more. If you went through chapter 15, in section 15.3 we reasoned about what it means for an embedding to be good, or just better than another. Edge intersections are a part of this, even an important one, but there are other consider­ations, for instance, that adjacent vertices should be close to each other. When there is no edge between a pair of vertices, they can and should be drawn far away from each other.

Drawing a graph in an aesthetically-pleasing way can be as or more important than just reducing the number of edges crossing.

It can make the graph look cleaner and more easily understandable, and it can help you use available space better (especially on dynamic websites) or make more meaningful charts.

And last but not least, an aesthetically pleasing appearance can be expressed with a better cost function, a smoother one for which we can use optimization algorithms like gradient descent.

1. A different criterion

When using straight-line drawings, we can imagine vertices as ions, electrically charged particles. When there is an edge between two vertices, the particles attract each other (as if they had opposite charges), while a pair of vertices not connected by edges repel each other.

Then we can try to find an equilibrium point for the system, a disposition of the particles such that all the forces balance each other and the system can maintain a stable condition. Sometimes, instead of explicitly computing the point of equilib­rium, we can try to approximate it by simulating the evolution of the system using a heuristic.

It turns out that there is a whole class of graph embedding algorithms that adopt this principle: the so-called force-directed graph drawing algorithms.

The goal of these algorithms is to lay down a graph’s vertices in the 2-D space so that adjacent vertices are more or less at the same distance, and as such, all edges are of the same length in the plane, and, of course, there are as few edge intersections as possible. This is done by computing forces among the adjacent vertices (attractive forces), and among all non-adjacent vertices (repulsive), based on their relative posi­tions, and then updating the system (that is, those positions) based on the forces com­puted and some parameters, trying to minimize the energy of the whole system.

To further refine our initial analogy, we can use springs (or gravity) as the physics counterpart of edges, and a fainter electrical repulsion[2] among all pair of vertices. Note that all these forces depend on the distance between the vertices—you can replace them with different formulas, as long as you keep this characteristic. Figure 16.20 gives you an idea of how such systems can work.

The next thing we need is to formalize these criteria into a formula for the cost func­tion. That will describe the landscape of the problem, which we can then try to explore by using gradient descent, or one of the other categories of algorithms we’ll discuss later in the book:

where the term inside the summations is the squared 2-norm, that when computed on the (vector) difference of the two points, gives us the square of the distance between the two points.

If you are wondering why we use the squared distance, it’s not just because it’s cheaper to compute.[3] The main reason is that the derivative of a square root is a pain. And, of course, the shape of the function’s surface would also be different.

Now, that’s a big improvement with respect to a step function. This function is at least differentiable, and that’s pretty good. Partial derivatives with respect to x and y coordinates of a generic vertex w can be precisely computed:

Scalars β and δ are so-called hyper-parameters of the algorithm. They balance the importance of the attractive and repulsive force, and we need to adjust their values to get the result we want; this can be done manually or automatically.

This isn’t always easy, of course. For instance, a large value for the attractive force parameter will work well for sparse graphs, keeping vertices from drifting apart, but for a dense graph, if β>δ, then all the vertices will end up converging to the center of the graph.

A possible alternative is deciding, based on the vertices/edges in the graph and on the size of the canvas where we embed the graph, the ideal length for an edge (or an ideal range for such length). This way, optimization will move away from solutions where all the vertices are clustered too closely:

Neither of these cost functions aims to directly reduce the number of intersections, but as you can imagine, having shorter edges and keeping adjacent vertices close to each other indirectly help drive that number down. And neither function is ideal, because they are not bowl-shaped, so they will have several local minima. While we can’t easily correct this shortcoming, we still have a workaround. We can use a ran­dom-restart algorithm, randomly selecting an initial position for the vertices, and surf the cost function downhill with gradient descent.

2. Implementation

Perseverance is the key, so if we repeat a gradient descent step a few times (or maybe a lot of times—it really depends on the context!), starting from different positions and perhaps even with different learning rates, the final result might not be that bad.

In the next chapter, we’ll see a more sophisticated technique to make our optimi­zation more flexible and raise our chances of landing a good result.

For now, let’s implement the single-iteration gradient descent solution, starting with listing 16.5.

The code in listing 16.5 is a duplicate of the body of the general-purpose gradient descent method we previously saw in listing 16.4. While it’s still possible to use that method, in this case we have a more specific domain that can allow us some optimization, and overall, I believe, to express more clearly how this algorithm works internally.

For instance, you can see that we never use the cost function, but we only need to be able to compute its partial derivatives. (Now it should make more sense why, as we had mentioned, we prefer to use the squared distance in the cost function to avoid square roots.)

Speaking of the gradient, the partial derivates can easily be computed using the formula we have provided (or a similar one, if you use a different cost function). It only requires running two for loops over the vertices, so explicit code is not shown here.

It’s easy to use this method for a random-restart algorithm: just decide how many attempts you’d like to perform, and run a loop calling method forceDirectedEmbedding.

The caveat is that in this case, we do need an explicit definition of the cost func­tion, because (as shown in listing 16.6) after each call to forceDirectedEmbedding, we will have to check the cost of the solution returned and compare it to the best result so far.

This concludes our discussion of gradient descent. In the next chapters, we will explore alternative algorithms for optimization of cost-based solutions.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *