Finding extreme values, either the maximum or minimum, is a very common problem in computing, not just in science but in many disciplines. It is sometimes referred to as optimization. Naturally finding a best (in some sense) value would be appealing. What is the least amount of fuel needed to travel from Chicago to Atlanta? What route between those two cities requires the least amount of driving time? What route is shortest in terms of distance? There are many reasons to want an optimum and many ways to define what an optimum is.
In the following discussion, the function to be optimized is provided. The problem is how to find the location (parameters) where the minimum or maximum occurs.
1. Newton Again
Figure 11.2 shows an example of a function to be maximized. There is a maximum of zz at the point x= x. How can this be found? If the nature of the function is known, for instance that it is a quadratic polynomial, then the optimum can be found immediately. It will be at the point where the derivative is zero. The problem of optimization is that one does not know much, if anything, about the function. It can only be evaluated, and perhaps the derivatives can be found numerically. Given that, how can the min or max be found?
If the derivative can be found, then it may be possible to search for an optimum point. At a value x, if the derivative is negative, then the slope of the curve is negative at that point; if the derivative is positive, then the slope is positive. If an x value can be found where the slope is negative (call this point x0) and another where it is positive (call this xt), then the optimum (slope = 0) must be between these two points. Finding that point can be done as follows:
- Select the point between these two (x = (x0+ xt)/2).
- If the derivative is negative at this point, let x0 = x. If positive, let xt = x.
- Repeat from Step 1 until the derivative is close enough to 0.
This process is almost random. Finding the two starting points is a matter of guessing until they are found. The search range gets smaller by a factor of 2 each iteration. The fact that the function can be evaluated at any point means that it is possible to make better guesses. In particular, it’s possible to assume that the function is approximately quadratic at each step. Quadratics have an optimum at a predictable place. The method called Newton’s Method fits a quadratic at each point and moves towards its optimal point.
The method is iterative, and without doing the math, the iteration is
A function to calculate the first and second derivative is needed. The formula for the second derivative is based on the definition of differentiation, as was the formula for the first. It is
The program should be straightforward. Repeat the calculation of xn1 – ƒ'(x)/ƒ”(x) until it converges to the answer. This will be the location of the optimum. An example function is as follows:
def newtonopt (f, x0, x1, delta=0.0001, niter=20):
x = (x0+x1)/2
fa = 1000.0
fb = f(x)
i = 0
print (“Iteration 0: x=”, x, ” f=”, fb)
while (abs(fa-fb) > delta): fa = fb
x = x – deriv(f, x)/derivsecond(f, x)
fb = f(x)
i = i + 1
print (“Iteration “, i, “: x=”, x, ” f=”, fb)
if i>niter:
return 0
This finds a local optimum between the values of x0 and x1. A local optimum may not be the largest or smallest function value that the function can produce, but may be the optimum in a local range of values.
Figure 11.2a shows a typical quadratic function. It is f(x) = x2 – 2x + 8, and has an optimum at x=1. Because it is quadratic, the Newton optimization function above finds the result in a single step. Figure 11.2b is a sine function, and can be seen to have many minima and maxima. Any one of them might be found by the Newton method, which is why a range of values is provided to the function.
The newtonopt() function successfully finds the optimum in Figure 11.2a at x = 1, and finds one in Figure 11.2b at x = 90 degrees (p/2 radians). If there is no optimum the iteration limit will be reached. If either derivative does not exist, then an exception occurs.
2. Fitting Data to Curves – Regression
Scientists collect data on nearly everything. Data are really numerical values that represent some process, whether it be physical, chemical, biological, or sociological. The numbers are measurements, and scientists model processes using these measurements to further understand them. One of the first things that is usually done is to try to find a pattern in the data that may give some insight into the underlying process, or at least allow predictions for situations not measured. One of the common methods in data analysis is to fit a curve to the data; that is, to determine whether a strong mathematical relationship exists between the measurements.
As an example, a set of measurements of tree heights will be used. The height of a set of a specific variety of trees is made over a period of ten years, and the data resides in a file named “treedata.txt.” Is there a linear relationship (i.e., does a tree grow generally the same amount each year)? Specifically, what is that relationship (i.e., how much can we expect a tree to grow)? Figure 11.3 shows a visualization of these data in the form of a scattergram or scatter plot, in which the data are displayed as points in their (x,y) position on a grid.
The “curve” to be fit in this case is a line. What is the equation of the line that best represents the data in the Figure? If that were known, then it would be possible to predict the height of a tree with some degree of confidence or to estimate a tree’s age from its height.
One form of the equation of a line is the point-slope form:
y = mx + b
where m is the slope (angle) of the line and b is the intercept, the place where the line crosses the Y axis. The goal of the regression process, in which the best line is found, is to identify the values of m and b. A simple observation is needed first. The equation of a line can be written as
mx + b – y = 0 (11.9)
If a point actually sits on this line, then plugging its x and y values into the equation results in a 0 value. If a point is not on the line, then mx + b – y results in a number that amounts to an error; its magnitude indicates how far away the point is from the line. Fitting a line to the data can be expressed as an optimization problem: find a line that minimizes the total error over all sample data points. If (x., y.) is data point I, then the goal is to minimize
by finding the best values of m and b. The expression is squared so that it will always be positive, which simplifies the math. It may be possible to do this optimization using a general optimization process such as Newton’s, but fortunately, the math has been done in advance for a straight line. Other situations are more complicated, depending on the function being fit and the number of dimensions
A simple linear regression is done by looking at the data and calculating the following:
Each of these can be calculated using a separate function. Then the slope of the best line through the data would be:
And the intercept is:
b = meany – m*meanx (11.12)
The function regress() that does the regression accepts a tuple of X values and a corresponding tuple of Y values, and returns a tuple (m, b) containing the parameters of the line that fits the data. It depends on other functions to calculate the mean, standard deviation, and correlation; these functions could generally be more useful in other applications. The entire collection of code is:
3. Evolutionary Methods
A genetic algorithm (GA) or an evolutionary algorithm (EA) is an optimization technique that uses natural selection as a metaphor to optimize a function or process. The idea is to create a collection of many possible solutions (a population), which are really just sets of parameters to the objective function. These are evaluated (by calling the function) and the best of them are kept in the population; the remainder are discarded. The population is refilled by combining the remaining parameter sets with each other in various ways in a process that mimics reproduction, and then this new population is evaluated and the process repeats.
The population contains the best solutions that have been seen so far, and by recombining them, a new, better set of solutions can be created, just as nature selects plants and animals to suit their environment. This method does not require the calculation of a derivative, so it can be used to optimize functions that cannot be handled in other ways. It can also deal with large dimensions, that is, functions that take a large number of parameters.
Consider the problem of finding the minimum of a function of two variables. This is an attempt to find values for x and y that result in the smallest function result. Evolutionary algorithms are often tested on difficult functions with numerous local minima or large flat regions. Two such functions are used here: the Goldstein-Price function,
(1 + (x + y + 1)2 (19 – 14x + 3x2 – 14y + 6xy + 3y2)
(30 + (2x +3y)2 (18 – 32x + 12x2 + 48y – 36xy + 27y2)) (11.13)
and Bohachevsky’s function,
Graphs of these functions are shown in Figure 11.4.
The first step in the evolutionary algorithm is to create a population of potential solutions. This is a collection of parameter pairs (x,y) created at random. The population size for this example is 100, and is a parameter of the EA process. This is done in the following way:
def genpop (population size):
pop = ()
for i in range(0, population size):
p = (randrange(-10, 10), randrange(-100, 100))
pop = pop + (p,)
return pop
The population is a tuple of a hundred (x, y) parameter pairs. These need to be evaluated, and so the objective function must be written. This differs for each optimization problem. In this case, it is the sum of the errors between a given line (one of the parameters) and the data points. One way to calculate this is:
def objective (x, y): # One of many possiblke
# objective functions
return goldsteinprice (x, y)
# return boha (x, y)
def goldsteinprice (x, y): # Goldstein-Price function
# f(0, -1) = 3
return (1+(x+y+1)**2 * (19-14*x+3*x*x – 14*y +6*x*y + 3*y*y)) * \ (30+(2*x-3*y)**2 *
(18-32*x+12*x*x+48*y-36*x*y+27*y*y))
def boha (x, y):
z = x*x + y*y -0.3*cos(3*pi*x) – 0.4*cos(4*pi*y) + 0.7
return z
All members of the population are evaluated, and the best ones (in this case, the ones with the smallest objective function value) are kept. A good way to do this is to have the values in a tuple E, where E[i] is the result of evaluating parameters P[i], and sorting the collection is the descending order on E. Since there are 100 entries in E, this means that E[0:n] contains the best n% of the population. The function eval()creates a tuple of the function evaluations for the whole population, and sort() organizes these and the corresponding parameters. These contain nothing new, and are not be shown here. The program here selects the best 10% and discards the remainder, replacing them with copies of the good ones.
The key issue is one of introducing variety in the population. This means changing the values of the parameters while, one would hope, improving the overall performance of the group. Using the metaphor of natural selection and genetics, there are two ways to introduce change into the population: mutation and crossover. Mutation involves making a small change in a parameter. In real DNA, a mutation changes one of the base pairs in the sequence which would usually amount to a rather small change, but which would be fatal in some cases. In the EA we are writing, a mutation is a random amount added to or subtracted from a parameter. Mutations occur randomly and with a small probability, which is named pmut in the program. Values between 0.015 and 0.2 are typical for pmut, but the best value cannot be determined, and it is problem specific. A value of 0.02 is used here.
The function mutate() examines all elements in the global population, mutating them at random (i.e., adding random values).
def mutate (m):
global pmut, population
for i in range (int(m), len(population)):
c = population[i]
if random () < pmut: # Mutate the x parameter
c[0] = c[0] + random()*10.0-5
if random () < pmut: # Mutate the y parameter
c[1] = c[1] + random()*10.0-5
population[i] = c
A crossover is more complex, involving two sets of parameters. It involves swapping parts of the parameters sets from two “parents.” Some parameters could be swapped entirely, in this case meaning that (x0, y0) and (x1, y1) become (x0, y1) and (x1, y0). Other times, parts of one parameter would be combined with parts of another. There are implementations involving bit strings that make this easier, but when using floating point values as is being done here, a good way to do a crossover is to select two parents and replace one of the parameters in each with a random value that lies between the original two.
def crossover (m):
global population, pcross
for i in range (m, len(population)): # Keep the best
# ones unchanged
if random () < pcross: # Crossover at
# the given rate
k = randrange(m, len(population)) # Pick a
# random mate
w = randrange (0, 1) # Change X or Y?
c = population[i] # Get individual 1
g1 = c[w] # Get X or Y for
# this guy
cc = population[k] # get individual 2
g2 = cc[w] # Get X or Y
if (g1>g2): t = g1; g1 = g2; g2 = t
# swap so g1 is smallest
c[w] = random()*(g2-g1) + g1 # Generate new
# parameter for 1
cc[w] =random()*(g2-g1) + g1 # Generate new
# parameter for 2
Sample output from three attempts to find the optimal value of Bohachevsky’s function is as follows:
These results show that sometimes the process takes much more time to arrive at a solution than others. It depends on the initial population, as well as on the parameters of the program: the mutation and crossover probabilities, the percentage of the top individuals to retain, and the nature of the mutation and crossover operators themselves.
Figure 11.5 outlines the overall process involved in the optimization. Details on specific techniques can be found in the references.
Source: Parker James R. (2021), Python: An Introduction to Programming, Mercury Learning and Information; Second edition.