Levenshtein Distance in C++

Frederik Bærentsen included in category Programming

2016-05-02 2016-05-02 884 words 5 minutes

Contents

Introduction

When working with strings in programming, I’ve often had to compare them to other strings in order to find the same string. This is pretty simple using things like = or strcmp, but if the two strings are not 100% the same, but maybe have one letter different, a problem arise.

When working on my Bachelor thesis, I had a camera that scanned text strings and using OCR found the letters that matched the best. The OCR matched decently, but often one or two letters were not recognized correctly.

To combat this, I looked into the Levenshtein Distance which basically gives each word a score based on how much alike they are.

Example

The best way of explaining how the algorithm works is by looking at an example. The table below shows the Levenshtein Distance matrix for the words `Hello` and `Holle`.

		H	e	l	l	o
	0	1	2	3	4	5
H	1	0	1	2	3	4
o	2	1	1	2	3	3
l	3	2	2	1	2	3
l	4	3	3	2	1	2
e	5	4	3	3	2	2

The first row contains the string; `Hello`, in the first column contains the second string; `Holle`. The second row and the second column gives the distance for each character to an empty string. This means that if the first string is `Hello` and the second string is nothing, then the distance from nothing to nothing is `0`, `H` to nothing is `1`, `He` to nothing is `2`, `Hel` to nothing is `3` and so on. Every insertion for a new character cost 1. The same is true for the column with `Holle`.

Implementation

The code for the first step of the Levenshtein Distance matrix looks something like:

1
2


d[i][0] = i;
d[0][j] = j

When the first row and column has been filled out, another part of the algorithm has to be used. This time there are three ways of finding the shortest distance to the new character.

Inserting a value
Deleting a value
Replacing a value

Both inserting and deleting a value has a cost of 1 whereas replacing a value can either have a cost of 1 or 2 or even more if that is needed. For this the replacement cost will just be 1. The code for this is:

1
2
3
4
5


d[i-1][j]+1
d[i][j-1]+1
d[i-1][j-1]+cost
if x[i] = y[j] then cost = 0
if x[i] != y[j] then cost = 1

Looking at the first table the result of Line 1 from the previous code will be `1 + 1 = 2`, Line 2 will be `1 + 1 = 2` and Line 3 will be `0 + 0 = 0`. This gives us three results; `0`, `2`, `2`, taking the smallest of the three is the distance between the fist two letters in the two words.

Taking the next letter `e` from word `Hello` and compute the distance from `H` in `Holle` (see the green highlight in the table below) gives the three results; `2 + 1 = 3` (yellow and green), `0 + 1 = 1` (blue and green) and `1 + 1 = 2` (pink and green). The smallest value is then 1, which means that from `He` to `H` there is a distance of 1.

		H	e	l	l	o
	0	`1`	`2`	3	4	5
H	1	`0`	`1`	2	3	4
o	2	1	1	2	3	3
l	3	2	2	1	2	3
l	4	3	3	2	1	2
e	5	4	3	3	2	2

This can be computed for all the characters in the two strings using the code:

1
2
3
4
5
6


for (int i = 1; i <= n; i++) {
  for (int j = 1; j <= ml j++) {
    int cost = (t[j-1] == s[i-1]?0:1);
    d[i][j] = min(min(d[i-1][j]+1,d[i][j-1]+1).d[i-1][j-1]+cost);
  }
}

Line 1: Run through the first string where n is the string length.
Line 2: Run through the second string where m is the string length.
Line 3: Calculate the cost. If the two characters are identical the cost is 0 else it is 1. t is the first string and s is the second string.
Line 4: Calculate the cost to be placed in the matrix d. Find the minimum between insertion and deletion then check the result against the replacement and find the smallest value. The result is stored in d.

Final

The final code for the function can be seen in the code below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


int distanceString(string s, string t) {
  int n = s.length();
  int m = t.length();
  int d[n+1][m+1];
  if (n == 0) {
    return m;
  }
  if (m == 0) {
    return n;
  }
  for (int i = 0; i <= n; i++) {
    d[i][0] = i;
  }
  for (int j = 0; j <= m; j++) {
    d[0][j] = j;
  }
  for (int i = 1; i <= n; i++) {
    for (int j = 1; j <= m; j++) {
      int cost = (t[j-1] == s[i-1])?0:1;
      d[i][j] = min(min(d[i-1][j]+1,d[i][j-1]+1),d[i-1][j-1]+cost);
    }
  }
  return d[n][m];
}

We can now use the function `distanceString(“Hello”,“Holle”);` which will return 2.