Applications of gradient descent

As mentioned earlier, gradient descent is an optimization technique that, given a cost function, helps find a solution (a point in the problem space) that has an optimal (or nearly optimal) cost.

As such, it is only a piece of the process of solving a problem, and at the same time, it can be applied to several different problems and techniques.

The overall algorithm depends first, as we have seen, on the cost function used, but also on the goal of the optimization.

We have already discussed optimizing a cost function to find the cheapest solution to a well-defined problem, and this is a category of algorithms that greatly benefits from the application of gradient descent, whenever we can describe a differentiable cost function.

Lately, though, a different category of algorithms using gradient descent has gained popularity: learning algorithms.

NOTE I’ve always found the name machine learning a bit deceptive, because it somehow suggests this branch involves machines that can learn in the same way humans do, which unfortunately is not the case, although there are some similarities.

When we apply gradient descent to solving a problem such as traveling salesman or graph embedding, we have a static (usually huge) domain, and our goal is to find a point in that domain. In machine learning, instead, we have a dataset, and we want to make sense of it by “learning” a model that describes the dataset and (more impor­tantly) generalizes to inputs that were not in the dataset.

Take, for instance, supervised learning (whose most prominent examples are shown in figure 16.17); this is a field where gradient descent is widely applied!

The goal of supervised learning is to develop a mathematical model[1] that can suc­cinctly describe the dataset, and is able to predict the output for new, never-seen inputs, be they a real value (linear regression), a category (logistic regression), or a label (clustering).

For all these types of learning, there is one extra step that we haven’t had in the optimization problems we’ve seen so far in this chapter when we were simply explor­ing a problem’s space. Now we also have to choose which model we want to use, which is actually the first thing we need to do.

To better explain this, we’ll go into some details of linear regression.

1. An example: Linear regression

Speaking of deceptive names, we couldn’t avoid mentioning linear regression. The story of the origin of the name of this learning technique is also fascinating, and it’s worth Googling it. Hopefully, we stimulated your curiosity about it.

But what is really important for us is that linear regression ultimately is about find­ing a model that describes the relation between one or more inputs (aka independent variables) and a real number outputted by the model (the dependent variable).

We’ve mentioned this “model” a few times now, and you can also see it in figure 16.17, so you might be asking what it is.

The model is a category of mathematical functions that we choose to approximate the true relation between dependent and independent variables in our dataset.

For instance, we might have a dataset associating some characteristics of cars (the year they were built, their engine, how many miles they’ve traveled, and so on) to their market price, and we would like to learn the relation between the former and the latter so that we can input the description (in terms of the same independent variables we had in the dataset) of a car we spotted at the dealer’s, and see if the price they are ask­ing is fair (and hopefully avoid being tricked into paying too much for a wreck).

We need to choose the model we think could best fit the data.[2] To keep it simple, let’s restrict to functions with a single parameter (it could be the engine power). As shown in figure 16.18, we can choose a constant function (y = m, a line parallel to the x axis), a generic line of the form y = mx + b, a quadratic curve (m2*x2 + m2 * x + b), or even more complicated models.

The simpler the model, in general, the fewest data points are needed to “learn it,” because after we choose the complexity of the model, we have a category of functions, and we still have to learn the parameters that tell us which specific function in that cat­egory is the best for our dataset.

For example, if we choose a generic line, then we still have to decide the values for the parameters m and b: it could be y = x + 1 or y = -0.5*x +42.

The way we choose those is through training, which is nothing other than applying gradient descent.

In linear regression, in fact, we define a cost function (usually referred to as a loss function in machine learning) that measures the distance between the value predicted by the model for the dependent variable associated to each point in the dataset,[3] and the actual value from the data.

This function is generally the sum of least square errors, or some variant of it. As shown in figure 16.17 and 16.19, we minimize the squared distance, along the y axis, between each point and the model line, and this gives us a convex, bowl-shaped func­tion with a global minimum—as we have seen, that’s pure gold for gradient descent!

One important fact to highlight in both figures 16.17 and 16.19 is that the loss func­tion depends on parameters m and b,[4] not on the points of the dataset; therefore, when we compute its partial derivates, they are computed with respect to the model parameters, which are then updated by gradient descent.

There is so much more to say about linear regression and supervised learning that it would take another full book!

And, in fact, there are so many books you can check, if you’d like to delve into machine learning. Here there are a few suggestions that I’ve personally found extremely useful:

  • Grokking Machine Learning, a nice starting point for beginners, written by Luis Serrano (Manning Publications, 2019). You couldn’t ask for a better guide.
  • Grokking Deep Learning, by DeepMind’s Andrew W. Trask (Manning Publications, 2019), an excellent introduction, ideal for approaching the world of deep learning.
  • Deep Learning with Python, written by Francois Chollet, the author of the Keras library (Manning Publications, 2017); you’ll learn how to use it to build image and text classification models and generators.
  • Deep Learning with JavaScript, by Shanquing Cai, et al. (Manning Publications, 2020), in case you’d like to build models for the web that run in the browser, using js and written by its main authors.

There are, of course, many more great books out there. It’s impossible to list them all here, but these are a good starting point.

Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *