Posts Tagged ‘software’

Matching Strings that don’t exactly match

September 16, 2009 Leave a comment

Today I was helping a fellow in our office tackle what appeared on the surface to be a really simple problem. He needs to confirm that records in one system match those in another. However, some of this data is maintained manually or at different times. Therefore, they might be the same thing, but the strings don’t match perfectly.

At the time were were trying to think of various comparison and sub-string functions.  Like usual though, after walking away for a few minutes, I remembers some common functions like soundex.  Soundex is Phonetic Algorithm

SQL Difference() and Soundex() Function

One implementation using soundex is built into Transact-SQL of MS SQL and perhaps Sybase.  Difference() returns a ranking of 0 to 4 with 4 being highest match.  Have to try this when I’m back in the office.

Tame the Beast

In his Tame the Beast article Simon White shows us a number of techniques for finding near matches to strings in an effort to improve usability in GUI applications.  Really interesting. Simon discusses several techniques for almost matching including:

  • Equivalence Methods
  • Synonyms and Regular Expressions
  • The Soundex Algorithm
  • Similarity Ranking Methods
  • Editing and Hamming Distances

You never know what sort of gold you’re going to find while looking for something else 🙂

Probabilistic Matching

This definitely ranks up there with terms like orthogonal and non-deterministic as conversation enhancers.  Really makes the boring topic of trying to see if two things are the same sound exciting.

In his blog post on Probabilistic Matching Steve Sarsfield talks about whether this highly marketable feature in many data quality tools is really a good idea to use in production.


I think tomorrow back in the office, we’ll give the difference() and soundex() calls a try and see if that doesn’t reduce the number of manual comparisons we have to do.


Mandated vs Compelling Shared Services

July 10, 2009 Leave a comment

If you’ve worked in a large enterprise you’ve surely come across the situation where there are more than one implementation of what seems to be the same thing. This sort of duplication really upsets our sensibilities. Our first reaction is to say how wasteful that is and wonder how to get everyone to reuse the same service.

I was reminded of this issue in a recent conversation with a Big n consulting firm. The fellow I was speaking with referred to the strong governance process in my organization as the reason why we should see this sort of redundancy eliminated. To this I replied with a my rant about how strong governance a euphemism for mandating that solution architects use the standard or enterprise technologies in their designs was not the right way to achieve the best outcome for the business. I point out that central planning has proven in a grand to be a failure. Why if this is so for an economy is it not so large enterprises?  Read more…