Data Deduplication & Matching Services

The Data Cleansing Group is an Australian provider of data deduplication & matching services. We are passionate geeks who do whatever it takes to solve your data issues.

Our data deduplication services include the identification of potential duplicates, reviews of suspect data and merging similar records into the target record, in CRM, financial, HR and production datasets. This process improves the quality of your data.

We also provide a matching service for identifying and linking the same asset, customer, patient, part, product etc. across multiple data sources. This process enriches your data, as the inaccessible gems from previously siloed data, can now be collectively analysed.

The Data Deduplication and Data Matching Cycle

Cost Effective Data Deduplication & Matching

A typical data deduplication service includes the following phases:

  1. Prepare Data

    Meticulous data preparation is paramount to a successful data matching or deduplication solution. One needs to ensure that the same type of data exists in each column. For example in the case of CRM data, this step ensures that your contact’s first name, last name, email address, mobile phone number etc. are all in their respective columns. This is called standardisation.

    Similar data is normalised (e.g. mister, Mr., mr are all converted to Mr. Or street, st., strt. are all converted to St.). Numbers are all converted to the same format (e.g. telephone numbers are all converted to a (xx) xxxx-xxxx format. Email and web addresses are also normalised as required.

    Whilst our normalisation data tables are quite extensive, we also mine & analyse your project data to ensure that our system covers your data’s quirks and idiosyncrasies. Applicable outliers are added to our normalisation data tables in order to improve the accuracy of the matching process.Wherever possible, missing information (eg. Post Codes, States, Country, Phone area codes, gender etc.) is recreated in this phase.

  2. Optimise Match Strings

    Duplicate rows are identified by matching one string of data against another. A string of data is created by joining the fields of each record into one ‘big word’ (eg: FirstName-LastName-StreetAddress-Suburb-PostCode-Country)

    Whilst a long match string gives better results than a short string, the key is to ensure that each of the data fields included in the match string are unique. Whilst this is conceptually simple, real data is often unkind and simple issues like missing data add complications. These are overcome by generating multiple match strings to suit the available data. Eg:

    • FirstName-LastName-MobilePhoneNumber
    • FirstName-LastName-AreaCode-PhoneNumber
    • FirstNameInitial-LastName-TaxFileNumber

    In order to achieve high match accuracy, the proposed match strings are tested on a significant sample of your data and results manually reviewed. During data deduping, the match strings are tweaked and the process reiterated until the match strings are fully optimised.

  3. Identify Potential Duplicates

    We use various computerised matching algorithms to identify potential duplicate records. These include Jaro, Jaro-Winkler, Levenshtein, Metaphone, Soundex as well as customised variants of the above. The algorithms compare the optimised match strings for every row of data and calculate a match score. Items with results above a certain threshold (eg. 90%) are flagged as potential duplicates.

    As each data set is different, we run all of our matching algorthims on a significant sample of the data and choose the one that provides the most optimal results. The chosen one is then run across the whole dataset.

  4. Automatic / Manual Confirmation

    For mission critical projects, we manually review the potential duplicates, and select the most applicable one.In some cases, (eg CRM data where you understand your customers better than us), we provide you with a spreadsheet highlighting the potential duplicates for your review and confirmation.

    An alternative and very cost effective approach is to automate the confirmation process by setting a hurdle rate (e.g. match accuracy of > 98%) and then automatically choosing the results above this benchmark.

    In cases where you are looking to identify “How many from List X exist in List Y” (e.g. marketing penetration analysis), we combine manual confirmations with statical sampling techniques to provide cost effective results with high confidence levels.

  5. Merging Similar Records Into Target Record

    In this phase, similar records can be merged into the target record, and/or duplicates flagged/deleted from the datasets. Deduplicated data is exported in your required format.

Typical Data Deduplication Projects

Our purpose built tools enable us to provide you with cost effective, high accuracy data deduplication and matching services. This service is targeted at organisations that are looking to:

  • Deduplicate their data.
  • Conduct marketing penetration analysis.
  • Identify and match assets, customers, patients, parts and products across multiple platforms and data sources.

Our typical data deduplication and matching solutions include:

  • Customer data.
  • Financial and customer spend data.
  • Patient data.
  • Asset tracking.
  • Product & inventory data.
  • Master data.

Improve your organisation’s insight by matching data across your various platforms and removing the cost-overhead associated with duplicate records.