ED

Published on November 2016 | Categories: Documents | Downloads: 80 | Comments: 0 | Views: 527

of 8

Content

Dynamic Programming Algorithm (DPA) for Edit-Distance
LA home Algorithms glossary Dynamic P' Edit dist' Hirschberg's Bioinformatics The words `computer' and `commuter' are very similar, and a change of just one letter, p->m will change the first word into the second. The word `sport' can be changed into `sort' by the deletion of the `p', or equivalently, `sort' can be changed into `sport' by the insertion of `p'. The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of: 1. change a letter, 2. insert a letter or 3. delete a letter The following recurrence relations define the edit distance, d(s1,s2), of two strings s1 and s2:
d('', '') = 0 -- '' = empty string d(s, '') = d('', s) = |s| -- i.e. length of s d(s1+ch1, s2+ch2) = min( d(s1, s2) + if ch1=ch2 then 0 else 1 fi, d(s1+ch1, s2) + 1, d(s1, s2+ch2) + 1 )

The first two rules above are obviously true, so it is only necessary consider the last one. Here, neither string is the empty string, so each has a last character, ch1 and ch2 respectively. Somehow, ch1 and ch2 have to be explained in an edit of s1+ch1 into s2+ch2. If ch1 equals ch2, they can be matched for no penalty, i.e. 0, and the overall edit distance is d(s1,s2). If ch1 differs from ch2, then ch1 could be changed into ch2, i.e. 1, giving an overall cost d(s1,s2)+1. Another possibility is to delete ch1 and edit s1 into s2+ch2, d(s1,s2+ch2)+1. The last possibility is to edit s1+ch1 into s2 and then insert ch2, d(s1+ch1,s2)+1. There are no other alternatives. We take the least expensive, i.e. min, of these alternatives. The recurrence relations imply an obvious ternary-recursive routine. This is not a good idea because it is exponentially slow, and impractical for strings of more than a very few characters. Examination of the relations reveals that d(s1,s2) depends only on d(s1',s2') where s1' is shorter than s1, or s2' is shorter than s2, or both. This allows the dynamic programming technique to be used. A two-dimensional matrix, m[0..|s1|,0..|s2|] is used to hold the edit distance values:
m[i,j] = d(s1[1..i], s2[1..j]) m[0,0] = 0 m[i,0] = i, m[0,j] = j, i=1..|s1| j=1..|s2|

m[i,j] = min(m[i-1,j-1] + if s1[i]=s2[j] then 0 else 1 fi,

m[i-1, j] + 1, m[i, j-1] + 1 ),

i=1..|s1|, j=1..|s2|
© L . A l l i s o n

m[,] can be computed row by row. Row m[i,] depends only on row m[i-1,]. The time complexity of this algorithm is O(|s1|*|s2|). If s1 and s2 have a `similar' length, about `n' say, this complexity is O(n2), much better than exponential! Try `go', change the strings, and experiment:
Top of Form

appropriate meaning approximate matching Needs JavaScript 1.1 or later ON!

Bottom of Form

Complexity
The time-complexity of the algorithm is O(|s1|*|s2|), i.e. O(n2) if the lengths of both strings is about `n'. The space-complexity is also O(n2) if the whole of the matrix is kept for a trace-back to find an optimal alignment. If only the value of the edit distance is needed, only two rows of the matrix need be allocated; they can be "recycled", and the space complexity is then O(|s1|), i.e. O(n).

Variations
The costs of the point mutations can be varied to be numbers other than 0 or 1. Linear gap-costs are sometimes used where a run of insertions (or deletions) of length `x', has a cost of `ax+b', for constants `a' and `b'. If b>0, this penalises numerous short runs of insertions and deletions. Longest Common Subsequence The longest common subsequence (LCS) of two sequences, s1 and s2, is a subsequence of both s1 and of s2 of maximum possible length. The more alike that s1 and s2 are, the longer is their LCS. Other Algorithms There are faster algorithms for the edit distance problem, and for similar problems. Some of these algorithms are fast if certain conditions hold, e.g. the strings are similar, or dissimilar, or the alphabet is large, etc.. Ukkonen (1983) gave an algorithm with worst case time complexity O(n*d), and the average complexity is O(n+d2), where n is the length of the strings, and d is their edit distance. This is fast for similar strings where d is small, i.e. when

d<<n.

Applications

Applications
File Revision The Unix command diff f1 f2 finds the difference between files f1 and f2, producing an edit script to convert f1 into f2. If two (or more) computers share copies of a large file F, and someone on machine-1 edits F=F.bak, making a few changes, to give F.new, it might be very expensive and/or slow to transmit the whole revised file F.new to machine-2. However, diff F.bak F.new will give a small edit script which can be transmitted quickly to machine-2 where the local copy of the file can be updated to equal F.new. treats a whole line as a "character" and uses a special edit-distance algorithm that is fast when the "alphabet" is large and there are few chance matches between elements of the two strings (files). In contrast, there are many chance character-matches in DNA where the alphabet size is just 4, {A,C,G,T}.
diff

Try `man diff' to see the manual entry for diff. Remote Screen Update Problem If a computer program on machine-1 is being used by someone from a screen on (distant) machine-2, e.g. via rlogin etc., then machine-1 may need to update the screen on machine-2 as the computation proceeds. One approach is for the program (on machine-1) to keep a "picture" of what the screen currently is (on machine-2) and another picture of what it should become. The differences can be found (by an algorithm related to edit-distance) and the differences transmitted... saving on transmission band-width. Spelling Correction Algorithms related to the edit distance may be used in spelling correctors. If a text contains a word, w, that is not in the dictionary, a `close' word, i.e. one with a small edit distance to w, may be suggested as a correction. Transposition errors are common in written text. A transposition can be treated as a deletion plus an insertion, but a simple variation on the algorithm can treat a transposition as a single point mutation. Plagiarism Detection The edit distance provides an indication of similarity that might be too close in some situations ... think about it. Molecular Biology

The edit distance gives an indication of how `close' two strings are. Similar measures are used to compute a distance between DNA sequences (strings over {A,C,G,T}, or protein sequences (over an alphabet of 20 amino acids), for various purposes, e.g.: 1. to find genes or proteins that may have shared functions or properties 2. to infer family relationships and evolutionary trees over different organisms Speech Recognition Algorithms similar to those for the edit-distance problem are used in some speech recognition systems: find a close match between a new utterance and one in a library of classified utterances.

Examp le An exampl e of a DNA sequen ce from `Geneb ank' can be found [here]. The simple edit distanc e algorith m would normall y be run on sequen ces of at most a few thousan d bases.

////////////////////////////////////////////

What is Levenshtein Distance?
Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For example,
• • If s is "test" and t is "test", then LD(s,t) = 0, because no transformations are needed. The strings are already identical. If s is "test" and t is "tent", then LD(s,t) = 1, because one substitution (change "s" to "n") is sufficient to transform s into t.

The greater the Levenshtein distance, the more different the strings are.

Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance. The Levenshtein distance algorithm has been used in:
• • • • Spell checking Speech recognition DNA analysis Plagiarism detection

Demonstration
The following simple Java applet allows you to experiment with different strings and compute their Levenshtein distance:

The Algorithm

Steps
Ste p 1 Description Set n to be the length of s. Set m to be the length of t. If n = 0, return m and exit. If m = 0, return n and exit. Construct a matrix containing 0..m rows and 0..n columns. Initialize the first row to 0..n. Initialize the first column to 0..m. Examine each character of s (i from 1 to n). Examine each character of t (j from 1 to m). If s[i] equals t[j], the cost is 0. If s[i] doesn't equal t[j], the cost is 1. Set cell d[i,j] of the matrix equal to the minimum of: a. The cell immediately above plus 1: d[i-1,j] + 1. b. The cell immediately to the left plus 1: d[i,j-1] + 1. c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost. After the iteration steps (3, 4, 5, 6) are complete, the distance is

2 3 4 5 6

7

found in cell d[n,m].

Example
This section shows how the Levenshtein distance is computed when the source string is "GUMBO" and the target string is "GAMBOL".

Steps 1 and 2
GUM BO 01 2 3 45 G 1 A 2 M 3 B 4 O 5 L 6

Steps 3 to 6 When i = 1
GUM BO 01 2 3 45 G 10 A 21 M 32 B 43 O 54 L 65

Steps 3 to 6 When i = 2
GUM BO 01 2 3 45 G 10 1 A 21 1

M 32 2 B 43 3 O 54 4 L 65 5

Steps 3 to 6 When i = 3
GUM BO 01 2 3 45 G 10 1 2 A 21 1 2 M 32 2 1 B 43 3 2 O 54 4 3 L 65 5 4

Steps 3 to 6 When i = 4
GUM BO 01 2 3 45 G 10 1 2 3 A 21 1 2 3 M 32 2 1 2 B 43 3 2 1 O 54 4 3 2 L 65 5 4 3

Steps 3 to 6 When i = 5
GUM BO 01 2 3 45

G 10 1 2 34 A 21 1 2 34 M 32 2 1 23 B 43 3 2 12 O 54 4 3 21 L 65 5 4 32

Step 7 The distance is in the lower right hand corner of the matrix, i.e. 2. This corresponds to our intuitive realization that "GUMBO" can be transformed into "GAMBOL" by substituting "A" for "U" and adding "L" (one substitution and 1 insertion = 2 changes).

ED

Comments

Content

Sponsor Documents

Recommended