Genomic Sequence Comparison


Introduction

Sequence comparison to determine how similar or different two sequences are is a fundamental operation of computational biology. Perhaps two laboratories want to compare their work on the same gene; or a biologist might want to determine similarities in gene sequences from different organisms, such as yeast and mouse. Programs that recreate a large DNA sequence from fragments use sequence comparison. As another example, some biologists are interested in searching databases for sequences that are locally similar to a given sequence in an attempt to discover gene function. (Refer to the "Genomic Data" module for background material for the current module.)


Alignment

In comparing two sequences, such as ATGAC and ACGC, we discover one or more alignments that gives the highest possible similarity score. An alignment of two DNA sequences has spaces in the sequences so that they are of the same length but so that a space in one sequence is not in the same position as a space in the other sequence. For example, with a dash (-) indicating a space, one alignment of s = ATGAC and t = ACGC is

s:
A
T
G
A
C
t:
A
-
C
G
C

Instead of stacking the alignments, we can write them as a pair, (ATGAC, A-CGC). Another alignment is (ATGAC--, ---ACGC) or

s:
A
T
G
A
C
-
-
t:
-
-
-
A
C
G
C

Scoring can be defined in a number of ways. Let us define the score for an alignment as the total of column scores, where the column scores have the following values:

For example, with indicated column scores, the alignment

s:
A
T
G
A
C
t:
A
-
C
G
C
column scores: 1 -2 -1 -1 1

has a score of 1 + (-2) + (-1) + (-1) + 1 = -2, while the alignment
s:
A
T
G
A
C
-
-
t:
-
-
-
A
C
G
C
column scores: -2 -2 -2 1 1 -2 -2

has a score of (-2) + (-2) + (-2) + 1 + 1 + (-2) + (-2) = -8. The similarity of two sequences is the maximum of all possible alignment scores.

Because we obtain the same column scores regardless of which sequence is in the first row, the order in which we write the sequences is irrelevant for determining an alignment score.  Thus, the similarity of two sequences does not depend on which we write first.  For example, the similarity of ATGAC and ACGC is the same as the similarity of ACGC and ATCAC.

Quick Review Question
Quick Review Question 1  Give the score for the alignment (ATGAC, ACG-C).


Similarity Score

Dynamic programming is a technique to determine the similarity and the alignment(s) that yield this highest score. The algorithm makes the best decision for prefixes, or subsequences from the start of the sequences, as it iterates over the length of those prefixes. We use the notation s[i..j] to indicate the subsequence from position i to position j, where the first position number is 0 as with arrays in C++. For example, with indexing starting at 0 in s = ATGAC, s[1..2] is TG; and s[0..2] is the prefix ATG. The notation s[i] is base i in sequence s, so that s[1] is T.

We write the developing intermediate similarity scores in a two-dimensional array, or matrix, a. The bases of s appear along the left margin of the array, starting at row 1; and the bases of t are on top, starting at column 1 (see Figure 1). Blanks, which we represent with hyphens, provide the headings for row 0 and column 0.  For clarity in Figure 1, we have the indices of s on the extreme left and the indices of the rows of a adjacent to the matrix.  Thus, 1 appears to the left of T, which is s[1]; and T heads row 2 of matrix a. Similarly, the indices of t appear above the corresponding bases; and the column numbers of a are immediately above the matrix.

Figure 1  Initial values in similarity matrix

Array position a[i][j] is the similarity score for prefixes s[0..i-1] and t[0..j-1]. For example, a[3][2] gives the similarity score for s[0..2] (ATG in our example) and t[0..1] (AC here). In row 0, we write the on-going scores for matching all spaces with prefixes of t. For example, aligning a space with the prefix t[0..0] = A costs -2; two spaces corresponding to t[0..1] = AT has a cumulative value of (-2) + (-2) = -4; three spaces and ACG, adds -2 to the previous score to obtain (-4) + (-2) = -6; and four spaces with t yields (-6) + (-2) = -8. Symmetrically, in column 0, we place the on-going scores for matching prefixes of s with all spaces. We draw line segments between subsequent values to indicate the derivation steps.


We give a general formula for score a[i][j] below, but first we illustrate the process with a particular example. To determine a[3][2], we only must know the row 3 heading (G), column 2 heading (C), and the scores in positions above (a[2][2]), to the left (a[3][1]), and diagonally left above (a[2][1]) the desired position. Figure 2 illustrates this situation. We obtain the best alignment of ATG and AC in one of three ways:

The value in a[3][2] is the maximum of the three scores.

Figure 2. Determine a[3][2] from a[2][2],a[3][1], and a[2][1]


The score -1 in a[2][1] indicates the best score for aligning prefix sequences s[0..1] = AT and t[0..0] = A. Going diagonally to a[3][2], we have the score for attaching G to the end of AT and C to the end of A. The alignment situation is as follows:

Best alignment of AT and A
G
C

Because G and C mismatch, this alignment of ATG and AC has a value of (-1) + (-1) = -2, which is the sum of the previous score and the mismatch Value, as the following illustrates:


From above a[3][2], a[2][2] = 0 is the similarity for prefixes sequences s[0..1] = AT and t[0..1] = AC. Thus, when extending the prefix of s to s[0..2] = ATG, we match a space with G from s, as follows:

Best alignment of AT and AC
G
-

Because matching with a space costs -2, this arrangement has a total value of 0 + (-2) = -2, as the following shows:


Finally, we consider a[3][1] = -3, which is to the left of a[3][2]. To go from the best alignment of the prefix ATG of s and A of t, we match a space in s to C in t, as follows:
Best alignment of ATG and A
-
C

The resulting value, a[3][1] + (-2) = (-3) + (-2) = -5, is not as good as the -2 values obtained from above and on the diagonal. The array element a[3][2] is the maximum of -2, -2, and -5, which is -2.

To summarize, a[3][2] is the maximum of the following values:

The general formula is as follows:



Figure 3 illustrates derivation of the value a[i][j].

Figure 3. Derivation of the value a[i][j]


A line segment is drawn to a[3][2] from any of the three positions that yields this maximum. In this example, we draw a line segment between a[2][1] and a[3][2] and between a[2][2] and a[3][2].
Quick Review Questions
Quick Review Question 2  Suppose a[2][2] = a[2][3] = 0, a[3][2] = -2, and row 3 and column 3 headings read "G," as in the following matrix:
G
0 1 2 3
0

 

 

 

 

                 
1

 

 

 

 

                 
2

 

 

0

0

                 
G 3

 

 

-2

 

 

 Determine the score from each position to a[3][3].


a.  Diagonal, a[2][2]


b.  Above, a[2][3]


c.  Left, a[3][2]


d.  The score of a[3][3] is derived from which direction?
Diagonal Above Left Right Below

Quick Review Questions
Quick Review Question 3  Suppose the value above is -5, the left position is 1, and the diagonal position is 0. The row label is G, and the column label is C, as in the following partial matrix:
C

0

-5

G

1

0

. Compute the possible values from each direction and the new value in the array of similarity values.

a.  Above


b.  Left


c.  Diagonal


d.  What is the matrix value?


e.  We draw line segment(s) from which position(s)?
Diagonal Above Left Right Below


Similarity Matrix

To derive the entire array, we initialize the first row and column as Figure 1 indicates. We then proceed row by row, from left to right, determining scores by the above formula. Having obtained values to the left, above, and on the diagonal, we can use the formula to evaluate a[i][j]. The answer to Quick Review Question 4 develops the complete array. 

Quick Review Question
Quick Review Question 4  Fill in the array of values for the similarity matrix below, indicating the direction(s) from which you derive each value.  Do not type pluses for positive numbers.  Employ the following scoring:
  • +1 for a match
  • -1 for a mismatch
  • -2 for a space in one of the corresponding positions
_ A          C        G        C  
0 1   2   3   4  
_ 0

 0

-

-2

  -

-4

  -

-6

  -

-8

 
     |              
A 1

 -2

 
    |        
T 2

 -4

    |        
G 3

 -6

    |        
A 4

-8 

    |        
C 5

 -10

       
    

a. Give the value of a[1][1].


b. Select the direction(s) from which to draw line segment(s) to the element.
top left diagonal

Quick Review Question 4 develops and Figure 4 contains the entire similarity matrix.  The value in the bottom, right corner, 0, is the similarity of ATGAC and ACGC.  Following the line segments from that corner backward to a[0][0], we see how best to align the sequences: Figure 4. Array of similarity values for ATGAC and ACGC
_ A C G C
0 1 2 3 4
_ 0

 0

-

-2

-

-4

-

-6

-

-8

     |  \            
A 1

 -2

1

-

-1

-

-3

-

-5

     |    |      
T 2

 -4

-1

0

-

-2

-

-2

     |    |  \  |   \  
G 3

 -6

-3

-2

-

 -1

| \ | \ | | \
A 4

-8 

-5

-4

 -1

 0

| | \ | \
C 5

 -10

-7

-4

 -3

 0

Figure 5 displays the similarity matrix of Figure 4 with the path from a[5][4] to a[0][0] indicating the best alignment.  Using Figure 5, Figure 6 gives the optimum alignment of ATGAC and ACGC. Sometimes, more than one alignment yields the same similarity.

Figure 5. From Figure 4, path indicating best alignment of ATGAC and ACGC in bold

_ A C G C
0 1 2 3 4
_ 0

 0

-

-2

-

-4

-

-6

-

-8

     |  \            
A 1

 -2

1

-

-1

-

-3

-

-5

     |    \      
T 2

 -4

-1

0

-

-2

-

-2

     |    |  \  |  \   \  
G 3

 -6

-3

-2

-

 -1

| \ | \ | | \
A 4

-8 

-5

-4

 -1

 0

| | \ | \
C 5

 -10

-7

-4

 -3

 0

Figure 6. Optimal alignment of ATGAC and ACGC obtained from Figure 5

s:
A
T
G
A
C
t:
A
C
G
-
C

Dynamic Programming Algorithms

Bases match, bases mismatch, and a base’s association with a space can carry different weights than +1, -1, and -2, respectively. Thus, the Dynamic Programming Algorithm (score), which determines the similarity array for two sequences, has input parameters for match Value, mismatch Value, and spacePenalty, as well as for sequences s and t. The output parameter is the array (a) of similarity values. The Dynamic Programming Algorithm follows:


score(s, t, match Value, mismatch Value, spacePenalty, a)
	Procedure to determine array of similarity values for two sequences
		of bases
Pre:
	s and t are non-empty sequences of bases.
	match Value is the numeric value for two paired bases in an alignment
    		agreeing.
	mismatch Value is the numeric value for two paired bases in an alignment
        	disagreeing.
	spacePenalty is the numeric value for a base being paired with a space
        	in an alignment.
	match Value > mismatch Value
	match Value > spacePenalty
	a is an m-by-n array, where m is the length of s plus 1 and n is the length of t plus 1.

Post:
	a is an array of similarity values for prefixes of s and t. The value
        	in the bottom, right corner is the similarity of s and t.
Algorithm:
	m <-- length of s
	n <-- length of t
	for i from 0 through m, do the following:
		a[i][0] <-- i * spacePenalty
    	for j from 0 through n, do the following:
		a[0][j] <-- j * spacePenalty
    	for i from 1 through m, do the following:
		for j from 1 through n, do the following:
			if (s[i - 1] equals  t[j - 1])
				diagonalValue <-- match Value
			else
				diagonalValue <-- mismatch Value
			a[i][j] <-- max(a[i - 1][j - 1] + diagonalValue,
     				     a[i - 1][j]		+ spacePenalty,
     				     a[i][j - 1]		+ spacePenalty)

Quick Review Question
Quick Review Question 5  The questions below refer to the example in Quick Review Question 4 and Figure 4 as they relate to the score algorithm.  Do not type quotation marks, apostrophes, or plus signs.

a. Give the value of s.

b. Give the value of t.

c. Give the value of matchValue.

d. Give the value of mismatchValue.

e. Give the value of spacePenalty.

f. Give the value of m.

g. Give the value of n.

h. Indicate the component to which the first loop assign values.
column 0 diagonal last column last row

row 0


i. Give the value that the second loop assigns to a[0][3].

j. Indicate how the third loop goes through the matrix.
A column at a time A diagonal at a time     A row at a time

k. When i equals 4, give the value of s[i -1].

l. When j equals 5, give the value of t[j -1].

m. When i equals 4 and j equals 5, give the value of diagonalValue.

n. When i equals 4 and j equals 5, referring to Figure 4, give the value of a[i - 1][j - 1] + diagonalValue.

o. When i equals 4 and j equals 5, referring to Figure 4, give the value of a[i - 1][j] +spacePenalty.

p. When i equals 4 and j equals 5, referring to Figure 4, give the value of a[i][j - 1] + spacePenalty.

q. When i equals 4 and j equals 5, give the value of a[i][j].

The algorithm below uses the array of similarity values to determine the optimal alignment of s and t in a method that is comparable to following line segments from the bottom, right to the top, left in the similarity array. The algorithm employs an initialization function, align, that calls a recursive function recAlign to develop the alignment. The function align has input parameters of this array and the original strings and output parameters of the aligned strings with a dash indicating a space. In recAlign, the plus sign (+) between a string and a character indicates concatenation, and null is the empty string.

align(s, t, a, match Value, mismatch Value, spacePenalty, sAlignment, 
			tAlignment)
    Algorithm to determine optimal alignment of two sequences of bases

Pre:
    s and t are sequences of bases.
    a is the array of similarity values for prefixes of s and t, as completed by score.
    spacePenalty is the numeric value for a base being paired with a space in an alignment.
Post:
    sAlignment and tAlignment are sequences in the optimal alignment for
         sequences s and t, respectively, with a dash indicating a space.
Algorithm:
	sLength <-- length of s
	tLength <-- length of t
	recAlign(s, t, a, match Value, mismatch Value, spacePenalty, sAlignment,
        	tAlignment, sLength, tLength)

recAlign(s, t, a, match Value, mismatch Value, spacePenalty, sAlignment, 
       tAlignment, i, j)
    Algorithm to determine optimal alignment of two prefix sequences of
         bases
Pre:
    s and t are sequences of bases.
    a is the array of similarity values for prefixes of s and t, as completed by score.
    i is an index that is less than or equal to the length of s.
    j is an index that is less than or equal to the length of t.
   spacePenalty is the numeric value for a base being paired with a space in an alignment.

Post:
    sAlignment and tAlignment are sequences in the optimal alignment for
        prefixes s[0..i - 1] and t[0..j - 1], respectively, with a dash indicating 
        a space.
Algorithm:
    if (i equals 0 and j equals 0)
	sAlignment <--	null
	tAlignment <--	null
    else
    	if (i > 0 and a[i][j ] equals a[i - 1][j] + SpacePenalty	// above
    		recAlign(s, t, a, match Value, mismatch Value, spacePenalty, 
    			sAlignment, tAlignment, i - 1, j)
       		sAlignment <--	sAlignment + s[i - 1]
    		tAlignment <--	tAlignment + ‘-‘
    	else if (j > 0 and a[i][j ] equals a[i][j - 1] + spacePenalty)	// left
    		recAlign(s, t, a, match Value, mismatch Value, spacePenalty, 
    			sAlignment, tAlignment, i, j - 1)
    		sAlignment <--	sAlignment + ‘-‘
    		tAlignment <--	tAlignment + t[j - 1]
    	else					// diagonal
    		recAlign(s, t, a, MatchValue, MismatchValue, spacePenalty,
    			sAlignment, tAlignment, i - 1, j - 1)
    		sAlignment <--	sAlignment + s[i - 1]
    		tAlignment <--	tAlignment + t[j - 1]

Quick Review Question
Quick Review Question 6  The questions below refer to the example in Quick Review Question 5 and Figure 5 as they relate to the align and recAlign algorithms.  Do not type quotation marks, apostrophes, or plus signs.

a. Give the value of sLength.

b. Give the value of tLength.

c. Give the value of sAlignment after completion of align.

d. Give the value of tAlignment after completion of align.

e. Referring to Figure 5, for i equals 4 and j equals 3, give the value of a[i]

f. Referring to Figure 5, for i equals 4 and j equals 3, give the value of a[i -1][j] + spacePenalty

g. For i equals 4 and j equals 3, complete the arguments in the call to recAlign.
recAlign("ATGAC", "ACGC", a, -2, sAlignment, tAlignment, , );

h. After the call to recAlign in Part g when i equals 4 and j equals 3, the value of sAlignment is ATG. Give the value subsequently assigned to sAlignment.

i. After the call to recAlign in Part g when i equals 4 and j equals 3, the value of sAlignment is ATG. Give the value subsequently assigned to sAlignment.

j. For i and j equal 1, referring to Figure 5, give the value of a[i][j ].

k. For i and j equal 1, referring to Figure 5, give the value of a[i - 1][j] + spacePenalty.

l. For i and j equal 1, referring to Figure 5, give the value of a[i][j - 1] + spacePenalty.

m. For i and j equal 1, complete the arguments in the call to recAlign.
recAlign("ATGAC", "ACGC", a, -2, sAlignment, tAlignment, , );

n. Indicate the value of sAlignment after the call to recAlign in Part m.
0 "A" "AT" null

o. After the call to recAlign in Part m when i and j equal 1, the value of sAlignment is null. Give the value subsequently assigned to sAlignment.

p. After the call to recAlign in Part m when i and j equal 1, the value of tAlignment is null. Give the value subsequently assigned to tAlignment.

The procedure align gives one optimal alignment, but not every one. When a choice exists, the algorithm gives priority to going vertically up, next to going to the left, and finally to going on the diagonal. An advantage of having the diagonal be the last option is that recAlign then does not need values for matchValue and mismatchValue. By altering the if-else statement, we can change this precedence.  For some sequence pairs, such variations yield different optimal alignments.

Exercises

1. Consider the two sequences ATGAC and ACGC.
    a. List every possible alignment in which ATGAC has no blanks.
    b. List every possible alignment in which ATGAC has at least one blank         and ACGC has only blank(s) at the end.

2. a. Trace through the score algorithm for the strings and similarity array in Quick Review Question 4 and Figure 4.
    b. Trace through align and recAlign algorithms for the strings and similarity array in Quick Review Question 5 and Figure 5.

3. a. Develop the similarity array for strings s = CGTA and t = GCATG using the scoring of +1, -1, -2 from the example in this module.
    b. What is the similarity score?
    c. Give the optimal alignment that align returns.
    d. Give all optimal alignments for these strings.

4. Repeat Exercise 1 for strings s = CGTA and t = GCATG.

5. Repeat Exercise 1 for strings s = ATGAC and t = ACGC using the scoring that a mismatch or space costs 1, while a match has score 2.

6. What is the complexity of score?

7. What is the complexity of recAlign?

8. Change the priority of the align algorithm so that when a choice exists, the algorithm gives priority to going vertically up, next to going on the diagonal, and finally to going to the left. Because priority is given to going up, the resulting alignment is called the "upmost alignment".  Give the additional parameters that align and recAlign need.

9. Change the priority of the recAalign algorithm so that when a choice exists, the algorithm gives priority to going to the left, next to going on the diagonal, and finally to going vertically up. Because priority is given to going to the left, the resulting alignment is called the "downmost alignment".  Give the additional parameters that align and recAlign need.


Projects

1. Develop a program using the dynamic programming technique that returns the similarity and an optimal alignment for pairs of sequences.

2. Complete Project 1, except modify align so that the procedure returns all optimal alignments. Hint: Use a stack and backtracking.

Copyright © 2002, Dr. Angela B. Shiflet
All rights reserved