Using Entity Resolution and Record Linkage to Find Fraud
July 1, 2015
Using Entity Resolution and Record Linkage to Detect Fraud, Part 1
Entity resolution, otherwise known as “record linkage”, “master data management”, “de-duplication”, and “merge/purge”, is often a topic for academic consideration and research, but how useful is it for detecting fraud? I would argue: very useful.
What is it?
Entity resolution is the process by which multiple records are related to one another and consolidated into a cluster based upon similar attributes. You get to define what makes an entity similar to one another, thus making entity resolution flexible depending upon your goal. If your goal is to standardize names and addresses, de-dupe your vendor file, and consolidate to one “golden record”, then entity resolution can identify matches on name, address, phone number, and other pertinent information, and link the records together into one vendor family. Just this exercise alone will 1) clean up your vendor file 2) illuminate possible linkages between vendors.
To show more graphically what entity resolution can do, suppose you have several vendor name variations for Hewlett Packard (HP), and you want to aggregate all of their invoices together, but their names are all different, and each has a different vendor ID. See Figure 1 below. When you filter the rows on the left through Entity Resolution software, it will assign a “Family ID” or cluster number to each, assigning them all to the same “HP Family” and will standardize the vendor name to “Hewlett Packard”.
Figure 1. Entity Resolution Example
How to Find Fraud Using Entity Resolution
The above example may assist auditors in aggregating data to the top level for financial reporting purposes, but how can entity resolution be used to detect fraud?
Let’s use a real example: Harriette Walters, perpetrator of the largest tax fraud scheme in Washington, DC history. Harriette Walters embezzled over $48 million in property tax refund checks which were directed to payees comprised of her brother, nephew, niece, and even her hair dresser and personal shopper at Neiman Marcus. A close examination of the payee names shows that although the names are not exact replicas, they have similar strands throughout. Samuel Earl Pope was the owner of “Head to Toe Salon”, and you can see variations of Pope and Walker throughout the below payee names:
According to the written indictment, Harriette Walters purposely varied the payee names so that it didn’t appear that the same person or entity was receiving multiple refunds in a given year. Using fuzzy name matching and entity resolution, these records would have been related just by name-matching alone. Intelligent name matching can filter out noise words like “Association” or “Assoc” or “Inc”, and identify key words that are rare. In this case “Pope” and “Walker” would have been linked using a good entity resolution solution.
Ms. Walters used the name-variation theme again for her fictitious tax refund payees, as shown in the table below – again you can see the similarity in the payee names.
With Entity Resolution, the various payees could have been related to one another and readily analyzed, showing multiple large refunds per year. As Ms. Walters’ scheme grew, so did her appetite, and the tax refund check amounts grew to over $400,000 for a single check. If entity resolution had been applied to these similar-sounding payees, the upward trend may have been caught, as well as the multiple false tax returns per year.
Three Practical Applications to Detect Accounts Payable Fraud
1) Vendor Entity Resolution
The first most practical application of entity resolution to detect Accounts Payable (A/P) fraud is to run entity resolution and record linkage software on the vendor file. One of the most common schemes in A/P fraud is to set up ghost/fictitious vendors and submit false invoices for payment. Sometimes the fraudster may utilize pieces of their own information (name, address, phone) to create the ghost vendor. Oftentimes, if the fraudster does this successfully once, the scheme will be repeated by the creation of many fictitious vendors. To identify this type of scenario proactively, run entity resolution or record linkage software on the vendor file. (We use entity resolution and record linkage terms interchangeably). Suppose you have 4 vendors with the following information below. Entity Resolution software can systematically link these 4 records together even though they are not all directly related to each other.
|Record #||Name||Address||Bank Acct #||Phone #|
|A||John Roberts||101 S. Main St., Fairfax, VA||25678150||703-356-1101|
|B||Jon Roberts||12235 Rezdec Circle, Fairfax, VA||10122345||703-356-1101|
|C||Kate Smith-Roberts||789 Wheeler Way, Tulsa, OK||10122345A||954-227-0050|
|D||David Miller||2245-A Beach Rd, Miami, FL||09773394||954-227-0050|
With the human eye, you may be able to tell that all four of the above records are related:
- Records A and B are related via similar name and same phone number.
- Records B and C are related via same/similar bank account number.
- Records C and D are related via same phone number.
This is very effective for fraud detection, because you can now relate row “A” to row “D” even though they are not directly related at all – they are only related through mutual relationships. This is at its essence link analysis, but instead of being shown graphically, it is shown in row and column format. Many link analysis tools are drill-down tools for suspects already under investigation – but what may be more effective is pro-active entity resolution and link analysis.
What is also very effective is to cross-walk your vendor file with the State Corporation Commission file of registered Corporations and LLCs. This can also be completed using entity resolution and record linkage software. If one of the businesses in a family is not registered, then this is a red-flag and the vendor should be investigated further.
2) Vendor / Employee Entity Resolution
The second most practical application of entity resolution, and name/address matching in general, is to apply it to the vendor and employee files, finding vendors that have information in common with employees. This approach will identify common links between vendors and employees, as happens when an A/P clerk or manager fraudulently creates a ghost vendor using some of their own personal data.
3) Employee / Employee Entity Resolution
A third application of entity resolution to detect potential A/P fraud is to find any links between employees. Ghost employees are also a way to embezzle money from the company – ghost employees can receive real paychecks and travel and expense reimbursement checks. Cross-check employees using name, address, phone number, SSN, and bank routing number. Any piece of data in common is suspect. With the list generated in this scenario, you will have false positives where family members both work for the company: husband and wife, brothers, sisters, and parent/child relationships. Even for these, do a cursory review and include the active/inactive flag for each employee to see if perhaps an inactive employee is still receiving any checks from the company.
Case Study: Unemployment Benefits Fraud
In another real-life application, Jaqueline Kennedy and several other accomplices filed false W-2 forms for employees who worked for fictitious shell companies, bilking the states of Illinois, Indiana, and Minnesota of more than $8.7 million in unemployment benefits and earned income credits. Some of the shell companies were registered in the defendants’ names, such as “Uniquely U Personal Services” with Tara Cox being the registered agent, but most of the companies are not found anywhere in the State Corporation online search tool. The online search tool shows active and inactive corporations, so it is indeed the case that most of these companies were never registered with the state of Illinois.
We suggest two things to combat this type of fraud, using entity resolution and record linkage:
1) First, they – the Unemployment Agencies – should verify that the employer/corporation exists. State Unemployment Agencies – or any entity wishing to validate employer existence – should cross-check employer information against the State Corporation list of registered Corporations and LLCs. Instead of doing this check manually, entity resolution and record linkage software can perform name and address matching systematically to determine if the employer listed on the unemployment benefits application is registered with the State.
2) Second, they should data-mine the State Corporation list to determine how many companies have the same registered agent listed and also how many active corporations share the same address or contact phone number. The Illinois State Corporation online search tool does not allow searching by registered agent name, but entity resolution software applied to the entire data base would easily identify all companies registered by the same agent using fuzzy matching technology.
This is a recommendation that goes far beyond Unemployment Agencies – this is just one example.
Entity Resolution: Software vs. Service?
I recommend building a customized entity resolution solution rather than purchasing a very expensive off-the-shelf software package (some cost more than $1 Million). The strength and accuracy of entity resolution relies on several factors: 1) the underlying intelligence of the fuzzy name and address matching, 2) the software’s ability to be customized and modified, 3) the size of the data base being processed, 4) the power of the software and server to handle millions of iterations of matching in a relatively short period of time. There are several Entity Resolution and Master Data Management off-the-shelf software packages available now, but I recommend developing a solution around the data available. I have personally performed entity resolution and record linkage for the World Bank, the Centers for Medicare and Medicaid, and Customs and Border Protection, and I find it difficult to apply a “one size fits all” to each scenario. Here is why:
The World Bank: In the case of the World Bank, they wanted a vendor file clean-up and de-duplication process. The challenge with The World Bank was their vendor addresses include just about every country on the planet, so addresses were not standardized. This led to the development of a very rich address-matching function that I developed in SAS that matches addresses accurately, independent of the order of the words. Entity Resolution allowed us to develop a much cleaner vendor data base for The World Bank, consolidating duplicates and developing a “golden vendor” record.
Centers for Medicaid and Medicare (CMS): For CMS, they wanted to match Medicaid providers to Medicare providers, in order to better support fraud investigations. I developed customized matching algorithms that matched Medicaid providers with Medicare providers, when a common identifier was not present. Additionally, similarly named providers were consolidated into one “family” so that fraud investigators could more readily analyze a provider. I included provider name, address, license number, phone number, specialty, address frequency, and last name frequency as factors in the matching and entity resolution. As anyone who deals with fraud knows, it is easy for fraudsters to fly under the radar if they divide their activity among several id’s. One provider may fly under the radar in both Medicaid and Medicare, but when you can review the provider’s combined Medicaid/Medicare activity, the provider may jump out as a clear outlier. By consolidating provider records into one family, eliminating variations due to typos or name-changes, fraud analysts can analyze trends more accurately.
Customs and Border Protection (CBP): For CBP, I developed SAS software and functions to consolidate cargo shipping city names so that TASPO (Targeting and Analysis Systems Program Office) could separate cargo going to urban areas versus rural areas in their predictive models. As an example, “Buenos Aires” had over 8,000 different variations because of the mail drop numbers, typos, and different provinces and sub-neighborhoods added to the address city name. There is no standardization in the shipping information, so it is very common to have this many variations. Entity Resolution was applied to map the 8,000 variations of Buenos Aires to the correct spelling.
I recommend using SAS for this type of work because it has a very rich platform for complex programming and data manipulation, and also allows users to create their own functions.
How Does Entity Resolution Work?
Part 2 of this article will delve into the details of Entity Resolution and how it works. Part 2 will cover using edit-distance functions such as Jaro-Winkler and Levenshtein, to developing a process for clustering using SAS code.
Comments are closed