Tagged with fuzzy matching

Fuzzy string matching and geocoding

One of the biggest problems encountered when trying to geocode a table is how to handle variations in spelling and naming between points. Databases are often messy, filled with spelling and typing mistakes – and this makes it impossible to simply match one table against another. That’s when we need to resort to the concept of “fuzzy string matching” – the “technique of finding strings that match a pattern approximately (rather than exactly)” 1.

The traditional method for fuzzy string matching during geocoding is the Levenshtein distance. The Levenshtein distance is perfect for automatically correcting spelling mistakes and small variations in spelling (eg  “Frankston-Flinders Rd” as opposed to “Frankston Flinders Rd“). When geocoding, you can automatically match any results with a distance of just 1 or 2 characters, or which is within a set percentage of the length of a string. For instance, the Levenshtein distance between “Swanton St” and “Swanston St” is one character (10% of the length of the first string), and the distance between “Box Hil Railway-Station”  and “Box Hill Railway Station” is two characters (~8.5% of the length of the first string). I’ve found that automatically matching distances between 1 character and 10% of the string length covers most simple typing errors with almost no false matches.

Unfortunately, the Levenshtein distance isn’t perfect. While it works well when the strings only contain small differences, it fails when trying to match two very different strings. This becomes critical when we’re trying to match the names of features. There’s often a huge variety in possible names for features. A train station could be entered as any number of different names, ranging from  “Box Hill Railway Station“, “Box Hill Train Station“, “Box Hill Station“, all the way to “Railway Station, Box Hill” or just “Railway Station“.

Let’s try a possible example. Trying to match the string “coles” to the name “coles supermarket” will have the high Levenshtein distance of 12 characters. By comparison, the Levenshtein distance between “coles” and “fat apes” is only six characters. If we are ranking matches by Levenshtein distance alone we’ll end up choosing the match “fat apes” over the obviously more relevant match of “coles supermarket“. Not terribly effective at all…

A better technique to use for matching these feature names is the the algorithm called “longest common substring2. This algorithm returns the longest string which is common to both input strings. Taking our previous examples, when matching “coles” to “coles supermarket” the longest common substring (LCS) is “coles“. Comparing “box hill railway station” to “railway station, box hill“, the LCS is “railway station“.

I’ve found that using the longest common substring for geocoding works best interactively. First, you take the string you’re trying to match, and sort a list of possible candidates by the length of the LCS, with longest matches first. This sorted list is then presented to the user to manually choose the most appropriate match.

Why not just do this automatically, without bothering the user? While most times the best match will be the first one in the list, there’s a few times when it may be second or third. Let’s say we’re trying to match the string “coles supermarket, westfields shopping centre”, and our feature database contains a specific point for “coles supermarket” and a general point for “westfields shopping centre”. If we were to automatically match to the point with the largest LCS, we’ll end up matching to the general point “westfields shopping centre” (LCS length of 26), rather then the slightly more accurate point for “coles supermarket” (LCS length of 17). Unfortunately, I can’t think of an automated way of prioritising the specific point over the general point in this case, so it’s safest to leave that choice to the user. (Let me know if you’ve got any clever ideas for this!)

This method can be taken one step further by padding both strings with a space. This has the effect of increasing the length of the LCS if both input strings match on a full word. For example, the LCS between “coles supermarket” and “coles” has length 5, while the longest common substring between “coles supermarket” and “dandenong markets” has length 6. If we just choose to use the standard LCS method on these strings we’ll incorrectly prioritise “dandenong markets” over “coles“. However, if we append spaces to the start and end of the strings, then we get:

" coles supermarket " and " coles ", LCS = " coles ", length 7
" coles supermarket " and " dandenong markets ", LCS = "market", length 6

Effectively, this process bumps up the length of an LCS by two characters if both input strings match with word breaks, rather than a portion of a word. It’s also possible to replace every space in the input strings by multiple spaces, if you’d like to further prioritise whole word matches.

In summary – Levenshtein distance is great for small variations in names, but for large possible variations in naming I’ve found LCS to be much more effective.

Notes:

  1. http://en.wikipedia.org/wiki/Approximate_string_matching
  2. Watch out, there’s a similarly named algorithm called “longest common sub sequence”. You don’t want this one at all.
Tagged , , , ,