Data DeDuplication is a commonly needed function for any data-driven industry. Customer mailing lists, vendor and supplier data bases, employee data bases, voter registration data bases are just a few examples where deduplication can improve the accuracy of the data. A complete deduplication and standardization of your data can reduce duplicate payments, prevent multiple mailings going out the door, and assist in accurately aggregating corporate reports despite multiple identifiers or multiple name spellings.
Our fuzzy-matching algorithms are designed to identify discrepancies in individual names, company names, addresses, and more. In addition to using the Jaro-Winkler and Generalized Edit Distance (GED) distance functions, our analysts have developed a suite of user-defined name and address matching functions.
Listed below are some projects we have completed involving deduplication:
- The World Bank vendor file deduplication and standardization project.
- Middlesex County, NJ vendor file deduplication and entity resolution project.
- NFL vendor file deduplication project.
- Large hospital in Southern California deduplication of staff and vendor files.
- Duplicate payment detection for many accounts payable audits.
Fuzzy matching is one of Automated Auditor’s core strengths. Fuzzy matching describes the ability to join text phrases that either look or sound alike but are not spelled the same. For example, “Elizabeth Banks” and “Banks, Liz E.” are close enough to the human eye and ear that they should be counted as similar. How is fuzzy matching performed, and why is it important?
Benefits of Fuzzy Matching
Fuzzy matching is the art and science of linking disparate words and phrases with one another. The benefits of utilizing fuzzy matching are too numerous to list, but it is commonly used to merge disparate data files which have no common key. For example, if you have student loan data and you need match it with student demographic data, but do not have the student’s social security number or student ID, you have to apply fuzzy-matching to match on the student’s name, address, or other identifying pieces of information. If you match with just exact student first name and last name, your match rate will suffer because of name variations (i.e. Robert O’Malley, Bob O Malley, Bobby O Malley). By using fuzzy-matching, you can drastically increase your match-rate.
Another benefit of implementing fuzzy-matching is to cleanse duplicate data. Cleansing and standardizing messy data is essential for maintaining a vendor file, for example. If you have a vendor or supplier file with duplicate records, your Accounts Payable department is more susceptible to making duplicate payments. By applying fuzzy-matching on the vendor’s name, address, phone number, email, and other identifying pieces of data, duplicate vendors can be identified and consolidated into one master vendor record, thereby preventing erroneous payments.
For example, suppose you have several different Hewlett Packard vendors, with the following spellings:
Using fuzzy matching, we consolidate these names into one standardized name so that an accurate aggregation can be made for this corporation.
How Do We Perform Fuzzy Matching?
There are several ways to perform fuzzy matching. We describe several methods that Automated Auditors commonly uses to tie disparate data together.
1) Levenshtein Distance – The Levenshtein Distance measures the “distance” between two phrases by counting the number of insertions, deletions, and substitutions it takes to make one phrase look like the other phrase. For example, see the Levensthein distances calculated here between two phrases. We are using the name of deceased Boston Bomber Tamerlan Tsarnaev to show that misspellings can be caught, disparate data bases can be joined, resulting in increased data completeness and accuracy. It is believed that a misspelling of “Tsarnaev” in a security data base may have contributed to him slipping through JFK Airport security undetected. Perhaps, if fuzzy-matching had been applied to critical security data bases, Tsarnaev could have been caught. Read About Tsarnaev Story Here
2) TriGram Function – The Trigram function, developed by Automated Auditors, returns the number of trigrams that two phrases have in common. A trigram is a consecutive 3-letter substring of a phrase. For example, the word “AUDITOR” has 5 trigrams:
AUD – UDI – DIT – ITO – TOR
Each word or phrase has (N-2) trigrams, where N is the length of the word or phrase. We also compare the percentage of trigrams that two phrases have in common, as shown in the below below. The PctTriGram function is particularly useful for phrase matching and address matching, but the example below shows a simple representation of how the function works:
3) Generalized Edit Distance – The generalized edit distance algorithm is a variation of the Levenshtein algorithm, and is used widely for comparing phrase similarity. SAS software contains a function called COMPGED that we utilize to match phrases. The COMPGED function in SAS returns the generalized edit distance between string-1 and string-2. The generalized edit distance is the minimum-cost sequence of operations for constructing string-1 from string-2.
4) Jaro-Winkler Function – The Jaro-Winkler function is commonly utilized for matching words, but not necessarily phrases. The version of the Jaro-Winkler function we have is tailored for SAS software, and very accurately identifies similarities between two words. This function returns a value between 0 and 1, with 1 representing an exact match.
It is important to note here that all of the above functions are useful for comparing single words – – but not necessarily phrases. Our team has developed a series of fuzzy matching algorithms that compare complex phrases, text, and addresses.
5) Phrase Matching – For intricate phrase and address matching, we leverage all of these functions to provide very comprehensive and accurate PHRASE AND ADDRESS fuzzy matching. Most of these functions, individually, are great for comparing single words, but not phrases. Our analysts harness the power of these functions in a very unique way to arrive at the best phrase matching algorithms.
6) Soundex – SOUNDEX converts an alphanumeric string to a four-character code that is based on how the string sounds when spoken. The first character of the code is the first character of character_expression, converted to upper case. The second through fourth characters of the code are numbers that represent the letters in the expression. The letters A, E, I, O, U, H, W, and Y are ignored unless they are the first letter of the string. Zeroes are added at the end if necessary to produce a four-character code.
SOUNDEX codes from different strings can then be compared to see how similar the strings sound when spoken.