ColdFusion Levenshtein Distance: String comparison and highlighting
This is a fun project I put out there a while back. I recently went through and optimized the performance a bit so I could officially blog it. It is an implementation of the Levenshtein Distance Algorithm in CFScript that I based off of a C# version written by Siderite Zackwehdex. Finding the "distance" between two strings is a means of comparing two strings to see how similar they both are. This can be done by finding the Longest Common String or LCS. It is as much a brain bender as it can be occasionally useful.
The basic gist of the concept is this: Iterate over two strings making a note of how many characters were inserted, deleted, or transposed from one string to the other. When a difference is found, "bookmark" where you are and start looking ahead in each string to see if the strings are going to start matching up again down the line. How far down the line you look is controlled by in an input called maxOffset. The LCS is the number of characters between the two strings which were identical. The "distance" of the two strings is simply the average string length minus the longest common string. The similarity of the two strings can be expressed as a percentage given 1 minus the strings' distance divided the length of the longest string. Enough theory-- let's look at an example:
I love to go ride my go-kart. (29 chars)
I love to ride my go-cart outside. (34 chars)
- There is a deletion of the word "go "
- There is a substitution of "c" for "k" in go-cart
- There is an addition of the word "outside"
Overall, the average length of the strings is 31.5 characters.
There are 25 characters in the two strings that are identical. This is our LCS.
String 1 is 4 characters different than string 2, and string 2 has 9 characters different than string 1.
So on average, we will say 6.5 characters would need to be changed to make the strings identical. This is our distance. (4 + 9 = 13 / 2 = 6.5)
Divide that distance by the length of our longest string and we can come to the conclusion that the strings are an 81% match. (1 - (6.5 / 34) = .8088 = 81%)
Here's another example:
|
The rain in Spain stays mainly on the plains The rain in Spain stays mainly on the plains The rain in Spain stays mainly on the plains Lorum Ipsum, yadda yadda. Lorum Ipsum, yadda yadda. La La La La Luke, I am your father. |
The rain in Madrid stays totally on the plains The rain in Spain stays mainly on the plains The rain in Barcelona stays entirely in the air Lorum Ipsum, Yabba dabba doo. Whatcha eatin? Nutin' Honey. Da Da Da Duke, I am your father. |
- Roughly 67 characters are different between the two strings.
- The Longest Common String (LCS) is 180.
- The strings are a 73% match.
In addition to porting the logic over to CFScript, there are a few things I added in to my function:
- If both strings are empty, I short circuit and return a 0 for distance and LCS. Similarity is 100%.
- If either string is empty, I short circuit and return the length of the non-empty string as the distance. The LCS and similarity is 0.
- To better detect differences at the start of the string I check the first three characters. That way a matching first letter wouldn't be confused in two strings like "top hat" and "the hat"
- When looking ahead in the strings trying to reconcile a difference I search for the distance of the maxOffset until I find THREE contiguous matching characters. This is to try and eliminate false positives.
- My function will highlight the differences between the strings by wrapping the deviations in the HTML tag of your choice. Default is <span style="background: yellow;"></span>
The function is fairly effective for finding basic insertions, deletions, and transpositions between to strings ranging form a few words, to a few paragraphs. Since the algorithm iterates through two strings without looking back, it WON'T find sentences that had their order rearranged. The maxOffset controls how hard the code tries to look and ahead and reconcile an insert or deletion. If you expect entire sentences to be inserted, your maxOffset needs to be at least as big as the largest insertion. Of course, the larger offset you allow for, the more performance will be impacted and the more likely you are to get a false positive when looking ahead.
Have fun with it. If you can think of a way to improve the code I would love to hear about it. I have included the code for the function, an example way to call it, and a zipped version of both below.
2
3 /*
4
5 StringSimilarity
6 Brad Wood
7 brad@bradwood.com
8 May 2007
9 Code adopted from Siderite Zackwehdex's Blog
10 http://siderite.blogspot.com/2007/04/super-fast-and-accurate-string-distance.html
11
12 Parameters:
13 s1: First string to be compared
14 s2: Second string to be compared
15 maxOffset: Average number of characters that s1 will deviate from s2 at any given point.
16 This is used to control how far ahead the function looks to try and find the
17 end of a peice of inserted text. Play with it to suit.
18
19 */
20
21 function stringSimilarity(s1,s2,maxOffset)
22 {
23 var c = 0;
24 var offset1 = 0;
25 var offset2 = 0;
26 var lcs = 0;
27 // These two strings will contain the "highlighted" version
28 var _s1 = createObject("java","java.lang.StringBuffer").init(javacast("int",len(s1)*3));
29 var _s2 = createObject("java","java.lang.StringBuffer").init(javacast("int",len(s2)*3));
30 // These chaactes will surround differences in the strings
31 // (Inserted into _s1 and _s2)
32 var h1 = "<span style=""background: yellow;"">";
33 var h2 = "</span>";
34 // If both strings are empty
35 if (not len(trim(s1)) and not len(trim(s2)))
36 {
37 return_struct.lcs = 0;
38 return_struct.similarity = 1;
39 return_struct.distance = 0;
40 return_struct.s1 = "";
41 return_struct.s2 = "";
42 return return_struct;
43 }
44 // If s2 is empty, but s1 isn't
45 if (len(trim(s1)) and not len(trim(s2)))
46 {
47 return_struct.lcs = 0;
48 return_struct.similarity = 0;
49 return_struct.distance = len(s1);
50 return_struct.s1 = h1 & s1 & h2;
51 return_struct.s2 = "";
52 return return_struct;
53 }
54 // If s1 is empty, but s2 isn't
55 else if (len(trim(s2)) and not len(trim(s1)))
56 {
57 return_struct.lcs = 0;
58 return_struct.similarity = 0;
59 return_struct.distance = len(s2);
60 return_struct.s1 = "";
61 return_struct.s2 = h1 & s2 & h2;
62 return return_struct;
63 }
64
65 // Examine the strings, one character at a time, anding at the shortest string
66 // The offset adjusts for extra characters in either string.
67 while ((c + offset1 lt len(s1))
68 and (c + offset2 lt len(s2)))
69 {
70 // Pull the next charactes out of s1 anbd s2
71 next_s1 = mid(s1,c + offset1+1,iif(not c,3,1)); // First time through check the first three
72 next_s2 = mid(s2,c + offset2+1,iif(not c,3,1)); // First time through check the first three
73 // If they are equal
74 if (compare(next_s1,next_s2) eq 0)
75 {
76 // Our longeset Common String just got one bigger
77 lcs = lcs + 1;
78 // Append the characters onto the "highlighted" version
79 _s1.append(left(next_s1,1));
80 _s2.append(left(next_s2,1));
81 }
82 // The next two charactes did not match
83 // Now we will go into a sub-loop while we attempt to
84 // find our place again. We will only search as long as
85 // our maxOffset allows us to.
86 else
87 {
88 // Don't reset the offsets, just back them up so you
89 // have a point of reference
90 old_offset1 = offset1;
91 old_offset2 = offset2;
92 _s1_deviation = "";
93 _s2_deviation = "";
94 // Loop for as long as allowed by our offset
95 // to see if we can match up again
96 for (i = 0; i lt maxOffset; i=i+1)
97 {
98 next_s1 = mid(s1,c + offset1 + i+1,3); // Increments each time through.
99 len_next_s1 = len(next_s1);
100 bookmarked_s1 = mid(s1,c + offset1+1,3); // stays the same
101 next_s2 = mid(s2,c + offset2 + i+1,3); // Increments each time through.
102 len_next_s2 = len(next_s2);
103 bookmarked_s2 = mid(s2,c + offset2+1,3); // stays the same
104
105 // If we reached the end of both of the strings
106 if(not len_next_s1 and not len_next_s2)
107 {
108 // Quit
109 break;
110 }
111 // These variables keep track of how far we have deviated in the
112 // string while trying to find our match again.
113 _s1_deviation = _s1_deviation & left(next_s1,1);
114 _s2_deviation = _s2_deviation & left(next_s2,1);
115 // It looks like s1 has a match down the line which fits
116 // where we left off in s2
117 if (compare(next_s1,bookmarked_s2) eq 0)
118 {
119 // s1 is now offset THIS far from s2
120 offset1 = offset1+i;
121 // Our longeset Common String just got bigger
122 lcs = lcs + 1;
123 // Now that we match again, break to the main loop
124 break;
125 }
126
127 // It looks like s2 has a match down the line which fits
128 // where we left off in s1
129 if (compare(next_s2,bookmarked_s1) eq 0)
130 {
131 // s2 is now offset THIS far from s1
132 offset2 = offset2+i;
133 // Our longeset Common String just got bigger
134 lcs = lcs + 1;
135 // Now that we match again, break to the main loop
136 break;
137 }
138 }
139 //This is the number of inserted characters were found
140 added_offset1 = offset1 - old_offset1;
141 added_offset2 = offset2 - old_offset2;
142
143 // We reached our maxoffset and couldn't match up the strings
144 if(added_offset1 eq 0 and added_offset2 eq 0)
145 {
146 _s1.append(h1 & left(_s1_deviation,added_offset1+1) & h2);
147 _s2.append(h1 & left(_s2_deviation,added_offset2+1) & h2);
148 }
149 // s2 had extra characters
150 else if(added_offset1 eq 0 and added_offset2 gt 0)
151 {
152 _s1.append(left(_s1_deviation,1));
153 _s2.append(h1 & left(_s2_deviation,added_offset2) & h2 & right(_s2_deviation,1));
154 }
155 // s1 had extra characters
156 else if(added_offset1 gt 0 and added_offset2 eq 0)
157 {
158 _s1.append(h1 & left(_s1_deviation,added_offset1) & h2 & right(_s1_deviation,1));
159 _s2.append(left(_s2_deviation,1));
160 }
161 }
162 c=c+1;
163 }
164 // Anything left at the end of s1 is extra
165 if(c + offset1 lt len(s1))
166 {
167 _s1.append(h1 & right(s1,len(s1)-(c + offset1)) & h2);
168 }
169 // Anything left at the end of s2 is extra
170 if(c + offset2 lt len(s2))
171 {
172 _s2.append(h1 & right(s2,len(s2)-(c + offset2)) & h2);
173 }
174
175 // Distance is the average string length minus the longest common string
176 distance = (len(s1) + len(s2))/2 - lcs;
177 // Whcih string was longest?
178 maxLen = iif(len(s1) gt len(s2),de(len(s1)),de(len(s2)));
179 // Similarity is the distance divided by the max length
180 similarity = iif(maxLen eq 0,1,1-(distance/maxLen));
181 // Return what we found.
182 return_struct.lcs = lcs;
183 return_struct.similarity = similarity;
184 return_struct.distance = distance;
185 return_struct.s1 = _s1.toString(); // "highlighted" version
186 return_struct.s2 = _s2.toString(); // "highlighted" version
187 return return_struct;
188 }
189
190
191 </cfscript>
2 The rain in Spain stays mainly on the plains
3 The rain in Spain stays mainly on the plains
4 Lorum Ipsum, yadda yadda.
5 Lorum Ipsum, yadda yadda.
6 La La La La Luke, I am your father.">
7
8 <cfset string2 = "The rain in Madrid stays totally on the plains
9 The rain in Spain stays mainly on the plains
10 The rain in Barcelona stays entirely in the air
11 Lorum Ipsum, Yabba dabba doo.
12 Whatcha eatin? Nutin' Honey.
13 Da Da Da Duke, I am your father.">
14
15 <cfset comparison_result = stringSimilarity(string1,string2,10)>
16
17 <cfoutput>
18 Roughly #comparison_result.distance# characters are different between the two strings.<br>
19 The strings are a #numberformat(comparison_result.similarity*100)#% match.<br>
20 The Longest Common String is #comparison_result.lcs#.<br>
21 <br>
22 <table border="1" cellpadding="10" cellspacing="0">
23 <tr>
24 <td>
25 #replacenocase(comparison_result.s1,chr(10),"<br>","all")#
26 </td>
27 <td>
28 #replacenocase(comparison_result.s2,chr(10),"<br>","all")#
29 </td>
30 <tr>
31 </table>
32 </cfoutput>

I was about to add a feature to a project that needs to tell the user that two pieces of text aren't different enough (for SEO purposes) and this looks like it'll work a treat.
Many thanks.
I need to correct an error in some software that has led to a database being out of touch with another reference database - don't ask about the bad DB design.
If this works, I'll owe you a cupcake!
Seriously though, if you need to compare two databases, Redgate software makes some very kick butt tools for that. They will show you line by line comparisons of stored procs and stuff, but when it comes to the differences in data I think it just tells you they aren't the same.