Levenshtein Distance in C++
Introduction
When working with strings in programming, I’ve often had to compare them to other strings in order to find the same string. This is pretty simple using things like = or strcmp, but if the two strings are not 100% the same, but maybe have one letter different, a problem arise.
When working on my Bachelor thesis, I had a camera that scanned text strings and using OCR found the letters that matched the best. The OCR matched decently, but often one or two letters were not recognized correctly.
To combat this, I looked into the Levenshtein Distance which basically gives each word a score based on how much alike they are.
Example
The best way of explaining how the algorithm works is by looking at an example. The table below shows the Levenshtein Distance matrix for the words `Hello` and `Holle`.
H | e | l | l | o | ||
---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | |
H | 1 | 0 | 1 | 2 | 3 | 4 |
o | 2 | 1 | 1 | 2 | 3 | 3 |
l | 3 | 2 | 2 | 1 | 2 | 3 |
l | 4 | 3 | 3 | 2 | 1 | 2 |
e | 5 | 4 | 3 | 3 | 2 | 2 |
The first row contains the string; `Hello`, in the first column contains the second string; `Holle`. The second row and the second column gives the distance for each character to an empty string. This means that if the first string is `Hello` and the second string is nothing, then the distance from nothing to nothing is `0`, `H` to nothing is `1`, `He` to nothing is `2`, `Hel` to nothing is `3` and so on. Every insertion for a new character cost 1. The same is true for the column with `Holle`.
Implementation
The code for the first step of the Levenshtein Distance matrix looks something like:
|
|
When the first row and column has been filled out, another part of the algorithm has to be used. This time there are three ways of finding the shortest distance to the new character.
- Inserting a value
- Deleting a value
- Replacing a value
Both inserting and deleting a value has a cost of 1 whereas replacing a value can either have a cost of 1 or 2 or even more if that is needed. For this the replacement cost will just be 1. The code for this is:
|
|
Looking at the first table the result of Line 1 from the previous code will be `1 + 1 = 2`, Line 2 will be `1 + 1 = 2` and Line 3 will be `0 + 0 = 0`. This gives us three results; `0`, `2`, `2`, taking the smallest of the three is the distance between the fist two letters in the two words.
Taking the next letter `e` from word `Hello` and compute the distance from `H` in `Holle` (see the green highlight in the table below) gives the three results; `2 + 1 = 3` (yellow and green), `0 + 1 = 1` (blue and green) and `1 + 1 = 2` (pink and green). The smallest value is then 1, which means that from `He` to `H` there is a distance of 1.
H | e | l | l | o | ||
---|---|---|---|---|---|---|
0 | 1 |
2 |
3 | 4 | 5 | |
H | 1 | 0 |
1 |
2 | 3 | 4 |
o | 2 | 1 | 1 | 2 | 3 | 3 |
l | 3 | 2 | 2 | 1 | 2 | 3 |
l | 4 | 3 | 3 | 2 | 1 | 2 |
e | 5 | 4 | 3 | 3 | 2 | 2 |
This can be computed for all the characters in the two strings using the code:
|
|
- Line 1: Run through the first string where n is the string length.
- Line 2: Run through the second string where m is the string length.
- Line 3: Calculate the cost. If the two characters are identical the cost is 0 else it is 1. t is the first string and s is the second string.
- Line 4: Calculate the cost to be placed in the matrix d. Find the minimum between insertion and deletion then check the result against the replacement and find the smallest value. The result is stored in d.
Final
The final code for the function can be seen in the code below:
|
|
We can now use the function `distanceString(“Hello”,“Holle”);` which will return 2.