Fuzzy matching is one of Automated Auditor’s core strengths. Fuzzy matching is the ability to join phrases that either look or sound alike but are not spelled the same. For example, “Elizabeth Banks” and “Banks, Liz E.” are close enough to the human eye and ear that they should be counted as similar. How is fuzzy matching performed, and why is it important?
Benefits of Fuzzy Matching
Fuzzy matching is the art and science of linking disparate words and phrases with one another. The benefits of utilizing fuzzy matching are too numerous to list, but here are the Top 3 Reasons to harness the power of Fuzzy Matching:
1-Increased ability to make linkages: It is widely believed that a misspelling of deceased Boston Bomber Tamerlan Tsarnaev’s name thwarted FBI’s efforts to track Tsarnaev’s 2011 trip to Russia. Could fuzzy matching logic have helped flag Tsarnaev’s trip in intelligence files?
2-Increased ability to merge disparate files: In a recent accounts receivables project, our analysts had to merge several disparate accounting files that lacked a common identifier to link upon. Our data mining experts utilized fuzzy-matching to link the files together by customer’s names. We were able to completely reconstruct accounts receivables year-end reports using fuzzy-matching as the key.
3-Increased ability to aggregate accurately: Aggregating by any kind of name is problematic when there are variations of the spelling of that name. For example, suppose you have several different Hewlett Packard vendors, with the following spellings:
Using fuzzy matching, we consolidate these names into one standardized name so that an accurate aggregation can be made for this corporation.
How Do We Perform Fuzzy Matching?
There are several ways to perform fuzzy matching. We describe several methods that Automated Auditors commonly uses to tie disparate data together.
1) Levenshtein Distance – The Levenshtein Distance measures the “distance” between two phrases by counting the number of insertions, deletions, and substitutions it takes to make one phrase look like the other phrase. For example, see the Levensthein distances calculated here between two phrases. We are using the name of deceased Boston Bomber Tsarnaev to show that misspellings can be caught, disparate data bases can be joined, resulting in increased data completeness and accuracy. Could fuzzy matching have saved lives?
2) TriGram Function – The Trigram function, developed by Automated Auditors, returns the number of trigrams that two phrases have in common. A trigram is a consecutive 3-letter substring of a phrase. For example, the word “AUDITOR” has 5 trigrams:
AUD – UDI – DIT – ITO – TOR
Each word or phrase has (N-2) trigrams, where N is the length of the word or phrase. We also compare the percentage of trigrams that two phrases have in common, as shown in the below below. The PctTriGram function is particularly useful for phrase matching and address matching, but the example below shows a simple representation of how the function works:
3) Generalized Edit Distance – The generalized edit distance algorithm is a variation of the Levenshtein algorithm, and is used widely for comparing phrase similarity. SAS software contains a function called COMPGED that we utilize to match phrases. The COMPGED function in SAS returns the generalized edit distance between string-1 and string-2. The generalized edit distance is the minimum-cost sequence of operations for constructing string-1 from string-2.
4) Jaro-Winkler Function – The Jaro-Winkler function is commonly utilized for matching words, but not necessarily phrases. The version of the Jaro-Winkler function we have is tailored for SAS software, and very accurately identifies similarities between two words. This function returns a value between 0 and 1, with 1 representing an exact match.
It is important to note here that all of the above functions are useful for comparing single words – – but not necessarily phrases. Our team has developed a series of fuzzy matching algorithms that compare complex phrases, text, and addresses.
5) Phrase Matching – For intricate phrase and address matching, we leverage all of these functions to provide very comprehensive and accurate PHRASE AND ADDRESS fuzzy matching. Most of these functions, individually, are great for comparing single words, but not phrases. Our analysts harness the power of these functions in a very unique way to arrive at the best phrase matching algorithms.