There are countless problems, both on graphs and on generic data structures, that can be formulated as optimization problems, for which a genetic algorithm could be an effective way to get to a near-optimal solution in a reasonable time.
We’ll briefly discuss two of them that are particularly relevant for their applications: finding the maximum flow of a graph, and protein folding.
1. Maximum flow
The maximum flow problem is defined on a specific type of directed graphs called networks (see figure 18.22). A network N=(V,E) is a directed, weighted graph where there are two special nodes s,t ϵ V: s is called the source, and t is called the sink of the network. The peculiarity of these vertices is that the source only has outgoing edges, while the sink only has ingoing edges.
In networks, edges’ weights are called capacity. The exact characterization of edges’ capacity depends on the context. For instance, for a hydraulic network, edges model pipes, and their weight is the maximum volume of liquid that can go through the pipe at any time.
A network is used to model the flow through its vertices, starting from a source and arriving at a sink, where an edge’s flow is the actual amount (for instance, of water) that goes through the edge. Each vertex v of the network has an inbound flow (for instance, the total amount of water through its ingoing edges) that can be redistributed to its outgoing edges, with the caveat that the total outbound flow must be equal to the inbound one. Referring to figure 18.22, showing an example of a network, node D can have a maximum inbound flow (the sum of its ingoing edges’ capacity) of 5 units, and a maximum theoretical outbound flow of 6. Therefore, the real flow through D can never be larger than 5, and that only happens when its inbound flow is at max: if D at some point had an inbound flow of 4, because some of the flow from A was diverted to E instead of D, its outbound flow would then be limited to 4. When the maximum possible outbound flow is larger than the inbound flow, as in this case, it means that we need to make a choice about how to route the outgoing flow, because we can’t use the capacity of outgoing edges to the fullest. For instance, considering that the max flow from F to the sink is 1, it makes more sense to route flow from D toward vertex G, so that we maximize the total inbound flow of the sink t: this is also the goal of the maximum flow problem.
More formally, a flow for a network is defined as a mapping between edges and real numbers, an assignment to each edge, such that:
- The flow of an edge is smaller or equal to the edge’s capacity.
- The total flow entering a vertex must be equal to the total flow exiting a vertex (except for s and t, which by definition have respectively null inbound and null outbound flows).
The value of the flow is defined as the sum of the flow exiting the source vertex s or, equivalently, the sum of the flow entering the sink vertex t.
A maximum flow for a network graph G is a valid flow whose value is the maximum possible for G.
Notice that a maximum flow problem is not NP-Hard, because there are several polynomial-time algorithms that can solve it exactly. These algorithms, however, are still at least O(|v|3), which makes them impractical for very large graphs; moreover, there are some variants of this problem that are NP-complete.
Now that the problem is clearly stated, how do we design a genetic algorithm for it?
First and foremost, as always, we need to decide about the fitness function and chromosome encoding. The latter is easy—we need one gene per edge, and each of them can be assigned any value between 0 and the edge’s capacity.
This encoding (shown in figure 18.23) automatically validates the first constraint in the formal definition of flow, but it can’t do anything for the second one. We still need to check each solution, because it will only be valid if the second constraint is abided by.
We can do this in two ways: either we discard invalid flows (probably not a good choice, for the same reasons as it wasn’t for vertex cover), or we add a term to the fitness function accounting for this.
For instance, we can compute the absolute value of the flow difference for all vertices (except s and t), and divide the solution’s flow by the cumulative difference, if that’s not zero.
Crossover and mutations are easier. We can use single point recombination and a punctiform mutation that randomly increases or decreases an edge’s flow by 1 (within the edge’s capacity).
If you’ve read through the chapter so far, you should have all the tools to not only implement this algorithm, but also to design more sophisticated, and possibly better, operators and try different definitions for the fitness function.
Instead of delving into the implementation, therefore, I’d like to spend a few words on why this is an important problem.
There are several practical applications of maximum flow in the physical world; for example, the scheduling of an airline’s flight crew, or even more interesting to us, the circulation-demand problem. If you remember, our example e-commerce company was looking for ways to maximize the trucks’ load and minimize the excess of capacity that is needed to face peaks in demand; it turns out that we could write that problem as a maximum flow problem!
In the circulation-demand problem, in fact, we have a source of goods, for instance, our warehouse, and a set of destinations where we need to deliver these goods. There are, however, also constraints. One is that the maximum capacity of each truck is an insurmountable limit, because we can’t overload them. And if we have several warehouses/factories, the problem becomes even more challenging.
But, besides these practical issues, maximum flow is a classic problem in computer science and software engineering!
A few examples of problems that can be modeled as maximum flow optimization are image segmentation, scheduling, network connectivity, and compilers optimization.
2. Protein folding
In the last few chapters we have mostly discussed graph problems that can be solved through optimization (meta)heuristics. Before wrapping up our review of GAs’ applications, I’d like to mention a different kind of problem, which is terribly important, as we’ve learned in these difficult times.
Proteins are large molecules, sequences of amino acids that are encoded in organisms’ DNA and (once DNA is decoded by cells) produced by their cells to perform many different functions within each organism: regulating the response to stimuli, transporting simpler molecules within and across cells, catalyzing metabolic reactions, even DNA replication itself, and many more—they are instrumental to the functioning of any organism.
Among other things, the 3-D structure of surface proteins determines the ability of viruses to tie to receptors and infect certain types of cells—something that recently moved protein folding to the top of the list of the most pressing problems to solve.
Two proteins differ by their sequence of amino acids, which determines the 3-D structure of the protein, which in turn determines the protein’s activity.
Protein folding is the physical process by which a linear protein chain (aka polypeptide) acquires its native 3-D conformation.
Now, the sequence of amino acids of a protein is determined by the DNA that encodes it, and is relatively easier to find out experimentally. It is the protein’s 3-D structure, however, that determines its functionality, and that’s much harder to determine in a lab (although new microimaging techniques are being developed).
One of the earliest and most successful applications of genetic algorithms was determining the most likely 3-D structure of proteins30 whose sequence of amino acids was known. The idea is similar to what we discussed in chapters 16 and 17 for force- directed graph drawing (another reason why that was an important topic!). We can design a fitness function taking into account the attraction and repulsion forces between peptides (sequences of amino acids), amino acids, and even down to their atoms, and search for the 3-D configuration of minimal energy for the system.
Genetic algorithms and artificial immune systems have been the heuristics of choice for the protein folding problem. It’s worth noting that at the time of writing, the current state of the art in this field is obtained by another biologically inspired algorithm, neural networks.31
3. Beyond genetic algorithms
Genetic algorithms are just a branch of the larger field of evolutionary algorithms (EA), a category including all generic population-based meta-heuristic optimization algorithms.
I couldn’t have chosen a better way to close this chapter, and the book, than with a summary of the most interesting EAs in computer science’s literature. Consider it a starting point to deepening your understanding of optimization algorithms.
Memetic algorithm is a class of optimization meta-heuristics directly derived from GA. Its peculiarity is that one or more of the mutation operators actually perform a local-search heuristics, something like random sampling over the current solution’s neighborhood, driving each individual toward a local optimum at each iteration (for instance, trying to flip each bit in a chromosome individually, and retaining the best result), and making this algorithm a hybrid between GA and random sampling, or potentially even gradient descent. The key in either situation is leveraging additional domain knowledge for the problem to perform the local optimization.
Artificial immune system (AIS) is a class of biologically inspired algorithms modeled after the immune system and its ability to learn and retain memory, that extends the work on genetic algorithms with operators like clonal selection, affinity maturation, and negative selection.
Particle swarm optimization,which is instead better used on numerical optimization, uses flock dynamics to explore the cost function landscape. The mechanism is similar to genetic algorithms, with a population (swarm) of solutions that is maintained and moved around, only not through genetic operators, but simpler rules inspired by a mix of kinematics and biology.
Ant colony optimization is a probabilistic technique inspired by the way ant colonies use pheromone-based communication to form paths to food sources. It’s particularly well suited for problems on graphs and specifically for those that can be reduced to finding paths in a graph.
And these examples are just the tip of the iceberg; there are plenty of evolutionary algorithms inspired by different biological principles, so many that they deserve a book of their own.
4. Algorithms, beyond this book
We are at the end of this book, but this is not the end of your algorithmic journey!
First of all, if you enjoyed the topics in this book, here is a list of suggested readings that can help you delve further into some of them:
- Grokking Machine Learning, by Andrew Trask, talks extensively about machine learning and some of the training algorithms we have briefly described in chapters 12, 13, and 16. If you’d like to learn how gradient descent is used in linear regression and logistic regression, this book is a good starting point.
- Grokking Artificial Intelligence Algorithms, by Rishal Hurbans, delves into some of the topics we summarized in this book, such as search and optimization and evolutionary algorithms, and presents advanced concepts we have just mentioned, such as swarm intelligence.
- Graph-Powered Machine Learning, by Allesandro Nego, is a good choice if you’d like to put what you’ve learned about graphs to good use and tackle machine learning from a different angle.
- Graph Databases in Action, by Dave Bechberger and Josh Perryman, explores another recent application of graphs. With this book, you’ll learn how graphs allow a better modeling of relations in data and why graph databases are the choice of election for highly-intercorrelated data.
- Algorithms and Data Structures for Massive Datasets, by Dzejla Medjedovic et al., expands on the topics in this book, focusing on techniques to adapt data structures and algorithms to massive datasets and handle modern big data applications.
And then, if you’d like to test your understanding of the data structures we have discussed, here is my challenge for you: implement these algorithms in your favorite language, and add them to this repository[4] I created on GitHub! You’ll find detailed instructions on the repo’s README.
Good luck for the next steps in your learning path, and I hope you will enjoy the coding challenge!
Source: Rocca Marcello La (2021), Advanced Algorithms and Data Structures, Manning Publications (2021)