geocoding | nyalldawson.net

Fuzzy string matching and geocoding

One of the biggest problems encountered when trying to geocode a table is how to handle variations in spelling and naming between points. Databases are often messy, filled with spelling and typing mistakes – and this makes it impossible to simply match one table against another. That’s when we need to resort to the concept of “fuzzy string matching” – the “technique of finding strings that match a pattern approximately (rather than exactly)”[ref]http://en.wikipedia.org/wiki/Approximate_string_matching[/ref].

The traditional method for fuzzy string matching during geocoding is the Levenshtein distance. The Levenshtein distance is perfect for automatically correcting spelling mistakes and small variations in spelling (eg “Frankston-Flinders Rd” as opposed to “Frankston Flinders Rd“). When geocoding, you can automatically match any results with a distance of just 1 or 2 characters, or which is within a set percentage of the length of a string. For instance, the Levenshtein distance between “Swanton St” and “Swanston St” is one character (10% of the length of the first string), and the distance between “Box Hil Railway-Station” and “Box Hill Railway Station” is two characters (~8.5% of the length of the first string). I’ve found that automatically matching distances between 1 character and 10% of the string length covers most simple typing errors with almost no false matches.

Unfortunately, the Levenshtein distance isn’t perfect. While it works well when the strings only contain small differences, it fails when trying to match two very different strings. This becomes critical when we’re trying to match the names of features. There’s often a huge variety in possible names for features. A train station could be entered as any number of different names, ranging from “Box Hill Railway Station“, “Box Hill Train Station“, “Box Hill Station“, all the way to “Railway Station, Box Hill” or just “Railway Station“.

Let’s try a possible example. Trying to match the string “coles” to the name “coles supermarket” will have the high Levenshtein distance of 12 characters. By comparison, the Levenshtein distance between “coles” and “fat apes” is only six characters. If we are ranking matches by Levenshtein distance alone we’ll end up choosing the match “fat apes” over the obviously more relevant match of “coles supermarket“. Not terribly effective at all…

A better technique to use for matching these feature names is the the algorithm called “longest common substring” [ref]Watch out, there’s a similarly named algorithm called “longest common sub sequence”. You don’t want this one at all.[/ref]. This algorithm returns the longest string which is common to both input strings. Taking our previous examples, when matching “coles” to “coles supermarket” the longest common substring (LCS) is “coles“. Comparing “box hill railway station” to “railway station, box hill“, the LCS is “railway station“.

I’ve found that using the longest common substring for geocoding works best interactively. First, you take the string you’re trying to match, and sort a list of possible candidates by the length of the LCS, with longest matches first. This sorted list is then presented to the user to manually choose the most appropriate match.

Why not just do this automatically, without bothering the user? While most times the best match will be the first one in the list, there’s a few times when it may be second or third. Let’s say we’re trying to match the string “coles supermarket, westfields shopping centre”, and our feature database contains a specific point for “coles supermarket” and a general point for “westfields shopping centre”. If we were to automatically match to the point with the largest LCS, we’ll end up matching to the general point “westfields shopping centre” (LCS length of 26), rather then the slightly more accurate point for “coles supermarket” (LCS length of 17). Unfortunately, I can’t think of an automated way of prioritising the specific point over the general point in this case, so it’s safest to leave that choice to the user. (Let me know if you’ve got any clever ideas for this!)

This method can be taken one step further by padding both strings with a space. This has the effect of increasing the length of the LCS if both input strings match on a full word. For example, the LCS between “coles supermarket” and “coles” has length 5, while the longest common substring between “coles supermarket” and “dandenong markets” has length 6. If we just choose to use the standard LCS method on these strings we’ll incorrectly prioritise “dandenong markets” over “coles“. However, if we append spaces to the start and end of the strings, then we get:

" coles supermarket " and " coles ", LCS = " coles ", length 7
" coles supermarket " and " dandenong markets ", LCS = "market", length 6

Effectively, this process bumps up the length of an LCS by two characters if both input strings match with word breaks, rather than a portion of a word. It’s also possible to replace every space in the input strings by multiple spaces, if you’d like to further prioritise whole word matches.

In summary – Levenshtein distance is great for small variations in names, but for large possible variations in naming I’ve found LCS to be much more effective.

Investigating MapInfo’s Geocode Routine

After doing a bit of work using MapInfo’s built-in Geocode routine I started getting curious about how the routine handles various special cases. After a bit of experimenting I thought I’d document what I found out.

Starting with the simplest case, a street with odd numbers (1-9) on the left, and even numbers (2-10) on the right. The red stars show where MapInfo geocodes house numbers 3 and 4:

So far so good – we can see that MapInfo has correctly determined that one side of our street corresponds to odd numbers and one to even, and it has correctly matched the address points to the corresponding side of the road. Let’s try a one sided road next:

MapInfo treats -9999 values in an address ranges table as “no address points”. This road segment is correctly handled by MapInfo, and house number 4 gets geocoded to the correct side of the road. Nothing unexpected so far, but now let’s split the street into two lanes, with odd numbers on one side and even on the other:

More or less what we want to see. I should point out that in all these tests I’ve left the default setting of insetting addresses 15% from the ends of the street, which explains why points 2/10 and 1/9 don’t fall right on the beginning and end of the road segments. I’m not sure why the points curve out and fall at varying distances to the road — but it’s close enough and I can live with that.

Just to see what happens, lets go back to one road segment with valid even numbers (2-10) on only one side, and try geocoding an odd number house (9):

MapInfo has geocoded the point right in the centre of the road segment. Not what we’d like, but at least the Find command returns a result of 11 (exact match, side of street undetermined) for this case and it’s possible to find and avoid these types of errors. What happens now if we mix odd and even numbers on the same road side?

Here we’ve set one side of the road to range from 1 to 10, and the other to contain no addresses (-9999). Unfortunately none of the house numbers, either odd or even, match correctly in this case. All numbers from 1-10 are placed at the centroid, with a result code of 11. I guess mixing odd and even numbers in a street segment like this should be avoided. If you have a street segment which does contain a range of odd and even numbers in reality, it looks like the only way to get this to work correctly is to duplicate the road segment, once with an odd number range and once with an even range. Hmmm… I wonder what happens if we have two matching street segments, one with this mix of odd and even numbers, and the second with just odd numbers?

Good – that’s encouraging to see. MapInfo can match the odd numbered houses to the segment with an odd address range of 1-9, even when we have a bad segment which covers the same number range. The even numbers don’t return a match, with a result code of -411 (multiple matches, side of street undetermined, exact match). Let’s take a step back and try a different type of conflict, where the address numbers on either side of the road overlap:

After a bit more testing (which I won’t go into here), it looks like when there’s an overlapping range like this and an address point could fall on either side of the road, MapInfo places it on the right hand side. For reference, overlapping ranges across two different road segments will geocode points which unambiguously fall on one road segment, and return a code of -401 (multiple matches, exact match) for the others:

(Address points 2 and 10 geocode, whereas 4, 6 and 8 don’t). One last thing to try – let’s see what happens when the address ranges table contains a point object. In this case we’ll add a point object with a left range of 1-9 and a right range of 2-10.

I wasn’t expecting this to work, but it appears to correctly geocode any numbers between 1 and 10 directly to the same location as the address ranges point. This is great news (for reasons I’ll possibly go into in a later blog post).

Quick Summary:

Don’t have an address range on the one side of a segment which consists of both odd and even numbers. If you do need to have a segment like this, you’ll need to duplicate the segment with one row having and odd range and the other having an even range.
Make sure to correctly handle any matches with a result code of side of street undetermined, since these may have just matched to the centre of the road
It’s OK to have point objects in an address ranges table

nyalldawson.net

Mapping, QGIS and other Open Source Ramblings

Tagged with geocoding …

Fuzzy string matching and geocoding

Investigating MapInfo’s Geocode Routine