Gradient descent: Optimization problems (notjust) on graphs

Why do we have to play “go fish” when we try local optimization? Remember that for a graph with n vertices, it means we are in 2n-dimensional space, so trying to move in every direction^[1] randomly . . . is a lot!

Looking at our 2-D example, where the cost function depends on a single variable, the direction we should explore seems pretty obvious!

But in a multidimensional space, where we can’t visualize the shape of the surface, how do we know which parameters should be tuned and in which direction?

The mathematical solution to this quest is called gradient descent. Gradient descent is an optimization technique that, under certain conditions, can be applied to different categories of optimization problems.

The idea is simple: we can look at the slope of a function at a given point and move toward the direction in which the function decreases the fastest. The name stems from the fact that for a differentiable function, the direction of the steepest slope is given by its gradient.

This is illustrated in figure 16.13 for a single-variable, differentiable function.

Before discussing it a bit more formally, let’s see how we could frame it in our marbles example: it’s like we are allowed to push marbles with a nudge, and they can travel a short distance before stopping again in the sand. Similar to what happens when you play with marbles in the sand, you need several nudges to reach the goal, and they only move a short distance. If you give them a nudge in the right direction, they move a little further toward the finish line (or your goal). But to make things more interesting, in our game you can only see a short portion of the track next to your current position, as if you were playing in the fog or with lens distortion.

Gradient descent is formally described using calculus. Don’t worry if you haven’t had an introduction to calculus, because you are not going to need it to apply gradient descent: there are many great libraries that already implement it for you. Actually, thinking about writing your own version is a bad idea, because this algorithm needs to be fine-tuned and highly optimized to exploit a GPU’s power.

1. The math of gradient descent

If you did take a calculus class, you probably remember the notion of a derivative: given a single-variable, continuous function f(x), we define its derivative as the ratio between how f changes in response to a small change in its argument x. Formally, we write

This value can be finite, infinite, or even not defined for a given function and a specific value of x; if for a function f its first derivative is always defined, we call f a differentiable function.

For differentiable functions, there are formulas that allow us to find the exact mathematical definition of their derivatives, which in turn are going to be a function themselves. For instance, if f(x)=x, then f’ (x)=1 (the constant function). The derivative of the quadratic function f(x) =x² is f'(x)=2x, and the derivative of the exponential function f(x)=e^x is f’ (x) =e^x (yes, it’s the exponential function itself!)

There are many interesting results from the geometric interpretation of function derivates, but we can’t go through them all here.

The most important result, from our point of view, is that if we compute the value of the first derivative of a function in a given point, it tells us if the function is growing in that point, and how much. In other words, and as a simplification, it tells us if by slightly increasing the value of x, f (x) also increases, or decreases, or stays the same.

We can apply this to our optimization algorithm. For instance, in figure 16.13, if we computed the first derivative of the cost function at point x₀, we’d get a negative value that would tell us that f grows when x becomes smaller. Since we want to move toward smaller values of f, we know that we should update x by assigning it a larger value.

We can repeat this step over and over, thus following a downhill path along the cost function’s surface.

If this looks easy to compute, however, in multidimensional spaces this gets far more complicated: for a n-dimensional function g, the gradient of the function at a given point is a vector whose components are the partial derivatives of the function, computed in that point.

For instance, with a 2-D domain (see figure 16.15 to visualize it), we define the partial derivative of g(x,y) along x as

And we define the gradient of g in a point P₀=(x₀ ,y₀) as a vector with one column and two rows, whose components are the partial derivates of function g with respect to x and y, computed at point P₀

The geometrical interpretation of the gradient of a function is that it’s a vector pointing in the direction of fastest growth of the function. That’s why gradient descent actually uses the negative gradient, – ∇g, which is simply the opposite of the gradient.

2. Geometrical interpretation

Listing 16.4 gives a summarized description of the gradient descent algorithm.

It’s important to understand that we don’t have the full view of the cost function when we run gradient descent, and we don’t “move” over the surface; rather, at each step we only compute how much we should change the input variables, depending on the slope of the surface at that point. See figure 16.14 to get an idea of the steps, and figure 16.15 to get an idea of how it looks like with a 2-D function’s domain.

In our marble-race analogy, it’s as if the track is swathed in a dense fog, and we can only see a few feet away: enough to see where to aim for a cautious next step, because if we push the marble too hard, we risk sending it off of the track, or in the wrong direction.

The method itself is actually surprisingly short and simple, isn’t it? It’s an iterative optimization algorithm, so we have a loop to perform, and we pass an argument with the maximum number of steps allowed, just to be sure it won’t loop forever (we’ll see that there are situations where this is possible).

We repeat the same update step until we get to a minimum of the function, or at most a certain number of times. The step itself is also basic: we compute the gradient of the input function coordinate by coordinate, by computing the first-order partial derivative of f along each of the directions of f’s domain (the problem space).

As mentioned, we can stop when we reach a minimum. One of the most important results in calculus is Fermat’s theorem^[3] that proves a point in the domain of a differentiable function is a minimum or a maximum if and only if the derivative of that function is zero at that point; therefore, we can just check that all the partial derivatives are zero (or, more realistically, that their value is below some precision).

By using the gradient of a function f to decide how much (and in what direction) we should move, we naturally take big steps when f changes fast, and small steps when f changes slowly. Transferred to our marble-race example, the marble would travel fast in a steep, straight section, while we would need to be careful when a turn is near to avoid going off-track.

As for the starting point P₀, you might wonder how we choose it. There are different ways, but unless you have domain knowledge telling you otherwise, it’s best to choose it randomly and possibly run the optimization several times, starting each time from a new, randomly chosen point and keeping track of the best overall result.

Can you see it? We are back to random sampling, but applying a sophisticated local optimization heuristic after each sample is taken.

3. When is gradient descent appliable?

To be able to apply gradient descent, we need the cost function to be differentiable, at least in the neighborhood of the points where we compute the gradient.

Moreover, it helps if we already know the exact formula for the function we would like to optimize, so that we can also express the partial derivatives with mathematical formulas and compute gradients exactly. If we don’t have the definition for the function to optimize, however, we can always resort to the formal definition of derivatives as mathematical limits, and compute the gradient numerically by explicitly evaluating the ratio between Af and Ax_± for increasingly small increments of each of the coordinates of the problem space.

One question you might want to ask this: If we do have the definition of f, and it is differentiable, why do we have to run an iterative optimization? Can’t we just find its exact minimum using calculus?

Well, that’s of course possible, in theory; it is also doable, at least for low-dimensional spaces and for some kinds of functions.

However, finding exact solutions becomes hard to automate, and even to compute, in high-dimensional spaces. The number of equations needed to analytically find the global minimum grows exponentially with the problem’s size. Moreover, these functions can have hundreds, thousands, or even an infinite number of local minima (think, for example, about sin(x+y) or even x*sin(y)), and to automate the search of a global optimum, we’d need to have all those points checked.

In general, gradient descent works well when we have a chance to design a cost function that has either a global minimum or, at most, a few local minima (better if they are of approximately the same cost). As we’ll see in section 16.4, that’s why it works perfectly with the kind of cost functions we design for supervised learning.

4. Problems with gradient descent

One important thing to note in listing 16.4 is that we provide a learning rate alpha. This is a hyper-parameter of the algorithm, regulating how big the steps are that we take. As in the marble analogy, when we don’t have a clear view of the track, taking large steps can speed us up, but it can also send the marble off-course; similarly, in gradient descent, large steps can miss minima (figure 16.16 (A)) or even worse, in some situations they can cause loops or even get far away from the best solution, in situations like the one shown in figure 16.16 (B).

Vice versa, when alpha is too small (figure 16.16 (C)), convergence can be far too slow and the optimization algorithm will never get to a minimum within a reasonable time (or within the maximum number of iterations allowed).

Which values of alpha are too big or too small also depends on the context, and in particular on the specific function we are trying to optimize.

If we didn’t have the chance to pass this learning parameter, then for cost functions such as the one in figure 16.16 (B), where the slope of the curve is steep, the optimization would not converge to the minimum (but rather diverge), and at the same time, for examples like 16.14 (C), convergence would be too slow, or the algorithm could get stuck in local minima.

By using a learning rate, we can tune^[4] this alpha hyper-parameter and adapt the optimization algorithm to the function we need to optimize.

An even better solution, however, is to use a variable alpha; for example, a value that decreases as the steps progress. Initially, it’s large enough to let the optimization quickly explore a wide area and possibly get out of local minima, and then it gets smaller and smaller, so that in the end, fine-tuning can be done and oscillation around stationary points (minima) is avoided.

Another great option is to introduce the concept of momentum. Instead of basing the next step on just the last gradient, gradient descent with momentum smooths the update by computing the delta as a linear combination of the last few gradients (with older gradients having a lower weight on the final result than newer ones).

As the term suggests, having a momentum (the way it happens in kinematics) means that if the algorithm speed was high, and so it was updating a coordinate with large steps, then when the slope of the curve changes, the speed will smooth out, but not abruptly.

The easiest formula to add momentum into our update rule for points can look like this

where P_t is a point in the problem space, specifically the point reached by the algorithm at time t. The higher is beta, the smoother (and slower) the update will be:

So, after 2 steps, if β=0.99, then 98% of the value of P₂ is given by P₀; conversely, if β=0.1, P₀ directly influences P₂ only for 1%.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Data structures and algorithms