Archive

Posts Tagged ‘data migration’

Matching Strings that don’t exactly match

September 16, 2009 Leave a comment

Today I was helping a fellow in our office tackle what appeared on the surface to be a really simple problem. He needs to confirm that records in one system match those in another. However, some of this data is maintained manually or at different times. Therefore, they might be the same thing, but the strings don’t match perfectly.

At the time were were trying to think of various comparison and sub-string functions.  Like usual though, after walking away for a few minutes, I remembers some common functions like soundex.  Soundex is Phonetic Algorithm

SQL Difference() and Soundex() Function

One implementation using soundex is built into Transact-SQL of MS SQL and perhaps Sybase.  Difference() returns a ranking of 0 to 4 with 4 being highest match.  Have to try this when I’m back in the office.

Tame the Beast

In his Tame the Beast article Simon White shows us a number of techniques for finding near matches to strings in an effort to improve usability in GUI applications.  Really interesting. Simon discusses several techniques for almost matching including:

  • Equivalence Methods
  • Synonyms and Regular Expressions
  • The Soundex Algorithm
  • Similarity Ranking Methods
  • Editing and Hamming Distances

You never know what sort of gold you’re going to find while looking for something else 🙂

Probabilistic Matching

This definitely ranks up there with terms like orthogonal and non-deterministic as conversation enhancers.  Really makes the boring topic of trying to see if two things are the same sound exciting.

In his blog post on Probabilistic Matching Steve Sarsfield talks about whether this highly marketable feature in many data quality tools is really a good idea to use in production.

Conclusion

I think tomorrow back in the office, we’ll give the difference() and soundex() calls a try and see if that doesn’t reduce the number of manual comparisons we have to do.

Advertisements

Data Migration

June 16, 2009 Leave a comment

Is it just me, or is Data Migration suddenly interesting?

I recently joined a big Australian Bank and noticed there are many data migration exercises to come as a result of a recent merger and major system replacement programs in the works.  I thought with so much data migration work in the pipeline, it would make sense to standardize the way data migrations were performed and perhaps even setup a competency centre or the like.

Some Research

I know a small consulting firm in Sydney named Lucsan Capital that specializes in helping financial markets firms with implementations of systems such as Murex and Calypso.  They have been engaged in many system implementation projects and found a need for a tool that combines Data Migration, Reconciliation and Process Management into a single easy to use platform.  They’ve built this tool called LMIG and use is as a practice aid on their implementation projects.

What’s the competition doing?

Though I have been pretty impressed with what I’ve seen of LMIG and the Lucsan people, before going too far, I thought I better do a bit of research to see what the competition has to offer.  I looked at Informatica and IBM Infosphere.  Both of these tools are leading ETL products and obvious candidates.  But there is definitely a dinstinction between the requirements of large mission critical ETL platforms – things that populate your data warehouse or act as information gateways and the needs of a project team working to quickly and safely migrate the data from one or more legacy systems to the target environment.

Informatica Data Migration Solutiona  contains information on the work informatica has been doing on both the tool and their Velociy Methodology to adapt to requirements of Data Migration.  There is also a fair bit of research by Bloor in this space which looks at the market opportunity and competing products.

Both these products appear to be world class.  Where the seem to fall down is their being almost too good.  They both seem to have many modules and options and many moving parts for their full deployments.  For example there are developer studios, process servers, schedulers, etc… all these things have to be idenfitied and costed in your final solution.  That may be suitable in the case of building a stable ETL environment.  But when working in Data Migration, you need a bit more agility and simplicity.  Something that gets the job done but doesn’t become the focus of your entire project.  Keep in mind, Data Migration is really just a necessary evil to achieve a strategic objective such as system consolidation or upgrade from a legacy to a shiny new system 🙂  You need to be sure your DM solution doesn’t divert your attention from the real objective.

How about OpenSource?

I generally love open source and Java for everything.   A few of my old colleagues at Macquarie Bank turned me on to Talend which is an OpenSource Data Integration platform.  It has a data profiling engine which sounds very interesting.  If I had a development team working for me, I’d probably be keen to go OpenSource.  But at the moment, I’m looking for something that is out-of-the-box and easy for Business Analysts to use.  I’ll look at Talend a little bit later.

Conclusion

I’d love to tell you the conclusion.  But I’m afraid the jury is still out.  I’m really looking forward to getting past the analysis stage and getting on with delivering some benefits to the business in terms of greatly reduced lead times to Data Analysis results and migration of data.  Let me know if you have any views on this topic.

Resources