Matching Strings that don’t exactly match
Today I was helping a fellow in our office tackle what appeared on the surface to be a really simple problem. He needs to confirm that records in one system match those in another. However, some of this data is maintained manually or at different times. Therefore, they might be the same thing, but the strings don’t match perfectly.
At the time were were trying to think of various comparison and sub-string functions. Like usual though, after walking away for a few minutes, I remembers some common functions like soundex. Soundex is Phonetic Algorithm
SQL Difference() and Soundex() Function
One implementation using soundex is built into Transact-SQL of MS SQL and perhaps Sybase. Difference() returns a ranking of 0 to 4 with 4 being highest match. Have to try this when I’m back in the office.
Tame the Beast
In his Tame the Beast article Simon White shows us a number of techniques for finding near matches to strings in an effort to improve usability in GUI applications. Really interesting. Simon discusses several techniques for almost matching including:
- Equivalence Methods
- Synonyms and Regular Expressions
- The Soundex Algorithm
- Similarity Ranking Methods
- Editing and Hamming Distances
You never know what sort of gold you’re going to find while looking for something else 🙂
This definitely ranks up there with terms like orthogonal and non-deterministic as conversation enhancers. Really makes the boring topic of trying to see if two things are the same sound exciting.
I think tomorrow back in the office, we’ll give the difference() and soundex() calls a try and see if that doesn’t reduce the number of manual comparisons we have to do.