What is Entity Resolution?
Entity resolution, often called record linkage or de-duplication, is a set of algorithms and fuzzy-matching techniques that consolidates data into higher-level categories. Identity resolution, for example, would be consolidating data from either one or multiple sources, so that all data is tied to one person’s identity. Entity resolution applies the same concept but typically to an entity or corporation name.
For example, suppose you have several vendors with different spellings of HP (Hewlett-Packard), and you want to aggregate annual spend for the entire umbrella vendor.
After Entity Resolution is applied to the HP problem, all of the vendors are consolidated to one standard vendor name.
Benefits of Entity Resolution
Entity resolution is useful in many industries. In accounting, cleansing and de-duping the master vendor file is a form of entity resolution, and helps to eliminate duplicate vendors. You can also use Entity Resolution to cleanse customer data bases and mailing data bases, keeping the most accurate and active record. This cleansing exercise will save time and money by eliminating potential duplicate payments and duplicate mailings.
But perhaps the most interesting use of Entity Resolution is to find FRAUD.
Entity Resolution can be implemented to uncover hidden relationships to detect fraud. For example, if you have the following table of vendor records, you can see with the human eye that pairs of records are related, but actually all of the records are related to one another indirectly.
Using your own eye, you can tell that the first 3 rows are linked via address even though the addresses differ slightly. Rows D and E are related via address, and rows F and G are related via similar phone number. If family members purchase cell phone service at the same time, they are often given sequential cell phone numbers. This scenario could be an actual case where Dr. Abdel is the founder of “Happy Home Care”, a fictitious home health care clinic. Walker Associates is a legitimate business that is fronting for Happy Home Care, and owns suites A and B in the same building. John Walker owns Walker Associates and their daughter is married to Mohammed Abdel, and they live in the basement apartment of 237 S. Ridgeway, which is a single family home.
This kind of linkage mechanism is very useful, especially when you can control the fuzzy matching underlying each individual match.
Automated Auditors, LLC, has developed proprietary entity resolution software called AnyConnection that employs several fuzzy matching techniques such as Jaro-Winkler and Levenshtein Edit Distance functions, proprietary edit distance matching functions, and can funnel the matches into AnyConnection to develop a broad umbrella “family” ID for the above 5 seemingly unrelated records. This allows investigators to pro-actively identify fraud issues, in any industry.
World Bank Case Study
Automated Auditors, LLC was retained to complete a Vendor File de-duplication and cleansing project for The World Bank. With a vendor file with over 300,000 records, containing many foreign and duplicate names and addresses, this project was the perfect match for Automated Auditor’s fuzzy matching logic and entity resolution techniques. Using our proprietary TriGram and PctTriGram functions and fuzzy matching logic, we were able to identify that (for example) the following two addresses were the same, despite the many different spellings and different order of the word sequence
|Complex Foreign Address Matching|
|Calle 3 villa santa la central canovanas pr 00729 buzon 2311|
|La central canovanes Calle 3, Villa Sant, Buzon 2311 PR 00729|
The Vendor File cleansing and de-duplication effort resulted in the identification of many thousands of duplicate records and the consolidation of many records into a single record per vendor.
Customs and Border Patrol Case Study
Customs and Border Protection (CBP) contains several shipping manifests for inbound and outbound cargo. The city name field on the manifests is an open-text field in which the sender can write anything – there is no “drop down box” to help standardize the city name. CBP had a need to standardize city names so they could identify whether a package was going to a large city or a rural area, and without name standardization, they couldn’t properly identify the city or the size of the city. For example: an extract of the shipping manifest data showed over 8,000 different variations of “Buenos Aires”! It sounds impossible that there could be this many variations, but because each province had a certain mailing number, and because of the misspellings, it was indeed possible. This bottleneck was delaying their efforts to develop accurate predictive models whose purpose was to isolate suspicious packages. Automated Auditors, LLC developed SAS algorithms to consolidate and standardize the city name, using fuzzy matching logic and entity resolution algorithms. Our proprietary software was able to standardize almost all of the variations to a single city name, allowing Customs and Border Protection to isolate the shipping send-to address more accurately.