Longest Common Subsequence (Edit Distance) in Python

So far in this chapter, the methods being discussed are numerical ones. There are, however, many algorithms that are not numeric in nature, but are more sym­bolic, involving patterns, pictures, sounds, or other more complex data forms. It is true that at some level all problems to be solved on a computer must be formu­lated using numbers, but in the examples so far, the numbers are the subject of the problem, and the problem would be solved numerically even if done with a pencil and paper. In other cases, this is not so.

As a major example, consider the problem of comparing two sequences of DNA. A sequence in this instance consists of a string of letters, each one referred to as a base in the sequence. DNA consists of a long sequence of base pairs involv­ing four molecules: Adenine (A), Guanine (G), Thymine (T), and Cytosine (C), linked together chemically. These ultimately define the structure of a protein, and it is the sequence that is important. A common problem in computational biology is to find the longest sequence in common between two DNA strands, where the samples may be from different individuals or even different species. Methods for doing this tend to involve the edit distance or Levenshtein distance.

The edit distance is a way of specifying how similar or dissimilar two strings are to one another by finding the minimum number of editing operations required to transform one string into the other. An editing operation can be a change in a character, a deletion, or an insertion. For example, what is the edit distance be­tween the word “planning” and the word “pruning”? It is 3:

p l a n n i n g

p r a n n i n g change “l” to “r”

p r u n n i n g change “a”   to      “u”

p r u n i n g    delete “n”

How is this used when looking at DNA? A DNA sequence is a set of the codes read from a piece of DNA, and it is a string containing only the letters G,A,T, and C. Comparing two pieces of DNA is a matter of comparing the two strings. The two strings AGGACAT and ATTACGAT are distance 3 from each other. The longest common sub-sequence has 5 characters in it:

AGGAC AT

ATTACGAT

1. Determining Longest Common Subsequence (LCS)

Exhaustive searching the two strings S1 and S2 for the longest common sub­sequence is simply be too slow for any practical purpose. Fortunately, we now have the Smith-Waterman method. It builds a matrix (two-dimensional array) where each character of the first string represents a column of the matrix, and each character in the other string forms a row, in order of appearance. The matrix is filled with numbers using the following relation:

The function σ(a,b) gives a penalty for a match/mismatch between two char­acters a and b. Here, it is 2 for a match and -2 for a miss. The gap penalty is the value assigned to leaving a gap in the sequence to perform a better match. Usu­ally, this is -1. The scheme offers a degree of flexibility, so that different penalties (and rewards) can be applied in different circumstances.

The first step in the Smith-Waterman method is to create a matrix (a table) T in which there are len(S1+1) columns and len(S2+1) rows. The first index in T(i,j) refers to the column, and the second index is the row. The values in the cur­rent row are a function of those in the previous one. Place a 0 in each element of the first row and the first column. For the two strings used previously, this would look like Table 11.1.

Now, for any element T(i,j) the neighboring elements are as follows:

The first cell to fill in the table T is T(1,1), marked with a * character in Table 11.1. The relation used to fill this cell has four parts:

The characters in the row and column match, so σ(S1 (i), S2 (j)) =σ(A, A) = 1 + T(0,0) = 2

  1. Gap penalty is -1, T(i-1, j) = 0, T(0,1) = 0. Result is -1
  2. Gap penalty is -1, T(i,j-1) = T(1, 1) = 0. Result is -1
  3. Result is 0

The maximum value of these four calculations is 1, so T(1, 1) = 2.

The next cell to compute is T(2,1). This time, the two characters are not the same, so

  1. T(i-1,j-1)+ σ(S1(i),S2(j))) where σ(S1(i),S2(j))=σ ( G,A)= -1 + T(1, 0) = -2
  2. Gap penalty is -1, T(i-1,j) = T(1, 1) = 2.

Result is 2 – 1 = 1

  1. Gap penalty is -1, T(i,j-1) = T(2, 1) = 0. Result is -1
  2. Result is 0 and so T(2, 1) = 1

For T(3, 1),

  1. G and A are not the same, o(G, T) = -2:
  2. T(i-1.j) = T(2,1) = 1 so 1-1 = 0
  3. T(i, j-1) = T(3,0) = 0 so 0 – 1 = -1
  4. 0

Result is 0 For T(4, 1),

  1. σ(A, A) = 2,
  2. T(i-1, j) = T(3, 1) = 0, 0-gap = -1
  3. T(i, j-1) = T(4, 0) =0, 0-gap = -1
  4. 0

Result is 2

For T(5,1),

  1. σ(C, A) = -2,
  2. T(i-1, j) = T(4, 1) = 2, 2-gap = 1
  3. T(i, j-1) = T(5, 0) =0, 0-gap = -1
  4. 0

Result is 1</NL>

For T(6,1),

  1. σ(A, A) = 2,
  2. T(i-1, j) = T(5, 1) = 1, 1-gap = 0
  3. T(i, j-1) = T(6 0) =0, 0-gap = -1
  4. 0

Result is 2 Finally

For T(7,1),

  1. σ(T, A) = -2,
  2. T(i-1j) = T(6, 1) = 2, 2-gap = 1
  3. T(i, j-1) = T(7 0) =0, 0-gap = -1
  4. 0

Result is 1

The result after row 2 is complete is shown in Table 11.3.

Now, move to the next row. The process repeats until all cells have been examined and assigned values. For this example, the final matrix is shown in Table 11.4.

The lower right entry is column 7, row 8, or (7,8).

This matrix indicates the degree of match at points in the string. To determine the actual match between the strings, begin with the largest value in the matrix.

In this case, it is in the lower right corner, but that’s not always true. Wherever the maximum is, start at that point in the matrix and trace left and upwards; this is essentially moving from the end of each string back to the beginning. At each step the move is left, up, or diagonally.

The process builds two ways to match the string. One indicates how to change si into s2 (call this M1), and the other indicates how to turn s2 into si (call this M2). Both strings are constructed from, in this case, (7,9) back to (0,0).

If there is more than one cell in T with a maximum value, then a route should be traced back from each maximum.

For the example string, the result is

M1= AGGACCAT

M2 =ATTAC_AT

There is a mismatch at the GG/TT pair and an inserted gap in M2.

2. NumPy

Let us now discuss the numerical Python package, NumPy. Python executes slowly by standards of languages like C and Julia because it is interpreted, not compiled into machine code. It also provides high level data structures like lists and dictionaries, and these involve a certain amount of computational overhead (they can be slow).

NumPy tries to get past this by using a new data structure that has much less overhead than does a list: the ndarray. It is a homogeneous (all elements have the same type) multi-dimensional (vectors, matrices, and more) table of the same kind as a C or Java array. An ndarray is indexed by unsigned integers beginning at 0. NumPy offers other features, but let’s start here.

3. One Dimensional Arrays (Vectors)

A one-dimensional array is usually used to represent a mathematical vector in scientific programming. A common use of a vector is to represent a direction. Let’s have an object at a location P = (x,y) on the screen, and give it a velocity of 4.5 in the x direction and 3 in the y direction. The velocity vector is V = (4.5, 3) as a tuple. Each iteration, the object will move by this amount, so in the first itera­tion x = x + 4.5 and y = y+3 will give the new position of the object. As vectors, we say that P = P + V.

There are four ways to create a vector like this in NumPy. First, we can use the built-in function array and pass it the tuple or list:

V = array((4.5, 3))               # from numpy import *

or

V = numpy.array((4.5, 3))         # import numpy

depending on how NumPy was imported. The variable V is now a NumPy array. Another way to create an array is by using the arange function:

V = arange(start, end, inc)

This initializes the array as numbers from start to end in increments of inc. The statement

V = arange(0, 10, 1)

gives V the value [0 1 2 3 4 5 6 7 8 9]. Starting at value 0 for 10 values, add 1 each time. A vector of length 2 could be created for V and then the values we want could be assigned:

V = arange(1., 2., 1)

V[0] = 4.5

V[1] = 3

Unlike other parts of Python, NumPy pays some attention to types. Notice that the call to arange use floating point numbers as start and end values. This creates a floating point nparray. If the statement had been V = arange(1, 2, 1) then the array V would contain integers.

The linspace function also creates an array, and is used when floating point arguments are used as parameters. It can be difficult to predict how the arange function will work sometimes. How many elements are created by the call arange(1,3,0.3)? The answer is 7:

[1.  1.3  1.6   1.9  2.2   2.5  2.8]

What linspace does is divide a range equally into N parts. So

linspace(1, 3, 9)

divides the range between 2 and 3 into 9 divisions:

[1.  1.25 1.5  1.75 2.   2.25 2.5   2.75 3.   ]

Finally, an array full of zeros is created:

z = zeros((3))

This creates an 1D array with 3 zero values:

[0. 0. 0.]

We can do basic arithmetic on arrays. If V = array((4.0, 3.0)) and W = ar- ray((5.0, 7.0)) then,

V+W is [ 9. 10.]

V-W = [-1. -4.]

V*W = [20. 21.]

V/W = [0.8     0.42857143]

We can do arithmetic with simple numbers (scalars), too:

2*W = [10. 14.]

The vector dot product is

V.dot(W) = 41.0.

Relational operators can be applied. So,

V<W = [ True True]

We can apply NumPy mathematics functions (not standard math library ones):

log(W) = [1.60943791 1.94591015]

The power of this new type is better illustrated in two dimensions as matrix operations.

4. Two Dimensional Arrays (Matrices)

Creating a matrix is done by passing the array function a list of lists, one for each row of the matrix. Consider the 3×3 matrix,

A = array( [[1, 0, 2],      [0, 1, 1]  [0, 2, 1] )

which is written in mathematical form as

All of the operators that have been described so far apply to these too, but also matrix multiplication. Multiplying A*A is simply done by using the @ operator:

A@A = [[1 4 4]

[0 3 2]

[0 4 3]]

Another convenient aspect of nparray is upcasting, meaning that the type of the array can change depending on the type of the operands. Multiplying an integer array by a floating point one gives a floating-point result.

Want a matrix of random numbers? First we create an instance of the NumPy random number generator, because the usual one does not know about NumPy types:

r = random.default rng(1)

Now, call random with the size of the matrix:

x = r.random((3,3))

which gives the result

[[0.51182162 0.9504637 0.14415961]

[0.94864945 0.31183145 0.42332645]

[0.82770259 0.40919914 0.54959369]]

5. Sample Problem: Finding Paths

We are going to solve a very prac­tical problem: Is there a way to get from point A to point B, and how many steps will it take? Admittedly, a step could be arbitrary, but this problem can be re-coded to find that answer, too. Consider the map in Figure 11.6. This is a simplified version of part of the New York subway system. Ten points have been identified here, and each point has a way to get to some other nearby points directly. An adja­cency matrix for this map has a row and a column for each point, and has a 1 if the two points are adjacent and a 0 otherwise. Adjacency here means con­nected with no nodes in between. For example, point 6 is adjacent to points 7 and 5 here. The points represent the following places:

  1. Marble Hill
  2. Pelham Bay Park
  3. Columbia University
  4. Lincoln Center
  5. Times Square
  6. Chambers Street
  7. Wall Street
  8. City Hall
  9. Grand central
  10. Queens Plaza

Is City Hall reachable from Wall Street? Using this map, how many stops are required for the journey?

In the adjacency matrix, we place a 1 if the row and column locations are im­mediately adjacent and a 0 otherwise. The matrix looks like this:

The algorithm for finding whether paths exist is not intuitive, but easy to implement using NumPy. When the adjacency matrix is multiplied by itself, the result has non-zero values in locations that represent points that are two steps apart. Multiply it again, and we can find points that are three steps apart, and so on. Using NumPy, this matrix could be initialized as follows:

Indexing a value in a two-dimensional array requires two indices. The first index references the row desired, and the second references the column. Numpy gives us the matrix multiply operations, so the program is as follows:

count = 1

b = adj*adj

while b[6][7] == 0:

b = adj@b

count = count + 1

print (count, b)

if count > 10:

break

print (“steps:  “, count)

As usual, the indices are always at 0, so adj[6] [7] refers to a connection be­tween points 7 and 8, which means Wall Street and City Hall.

6. Linear Regression Again

The linear regression program that was written earlier can be written with fewer lines of code, and it will be faster too, if we use NumPy. The array opera­tions make the code more compact. For example, finding the means of the x and y data can be done in one line each if both are NumPy arrays:

mx = x.sum()/len(x)          # Mean X

my = y.sum()/len(y)          # Mean y

The same is true of the standard deviation:

sdx = sqrt ( sum ( (x-mx)**2 )/(len(x)-1) )

                                # standard deviation X

sdy = sqrt ( sum ( (y-my)**2 )/(len(y)-1) )

                                # standard deviation Y

The expression x-mx actually results in a new array in which each element is the corresponding value of the x array with mx, the mean, subtracted from it. (x-mx)**2 squares each element in this array. The code that computes the correla­tion is also much more compact.

This NumPy implementation is about twice as fast as the original:

from numpy import *

def correlate (x, y, meanx, meany):

a = sum ( (x-meanx)*(y-meany) )

b = sum ((x-meanx)**2)

c = sum ((y-meany)**2)

v = a/sqrt(b*c) return v

def regress (x, y):

mx = x.sum()/len(x)          # Mean X

my = y.sum()/len(y)          # Mean y

sdx = sqrt ( sum (  (x-mx)**2 )/(len(x)-1))    # standard deviation X

sdy = sqrt ( sum (  (y-my)**2 )/(len(y)-1))      # standard deviation y

if sdx == 0:

return r = correlate (x, y, mx, my)

m = r * sdy/sdx

b = my – m * mx

return (m, b)

def err (x, y, m, b):

return sum(m*x + b – y)

f = open (“treedata.txt”, “r”)

s = f.readline ()

x = zeros((100))

y = zeros((100))

for j in range(0,100):

for i in range (1,len(s)):

if s[i] == “,”: break

x[j] = float(s[0:i-1])

y[j] = float(s[i+1:])

s = f.readline()

line = regress(x, y)

print (“Equation is y = “, line[0], “*x + “, line[1])

print (“Error is “, err(x,y, line[0], line[1]))

Source: Parker James R. (2021), Python: An Introduction to Programming, Mercury Learning and Information; Second edition.

Leave a Reply

Your email address will not be published. Required fields are marked *