Fuzzy Matching, SAS
Using PROC FCMP to Improve Fuzzy Matching
Good Morning! Here is a snippet from a SAS presentation I made at SESUG on Oct 17th, 2016.
If you are a SAS enthusiast, using PROC FCMP can be a powerful tool in your arsenal. You can create your own functions and leverage them in Data Steps and PROC SQL. In this paper, I describe how to use PROC FCMP to create functions to increase the accuracy of fuzzy matching.
Paper DM-126
Using PROC FCMP to Improve Fuzzy Matching
October 17, 2016
Abstract
Fuzzy matching is the art and science of matching inexact phrases, names, addresses, numbers, and other text. The need for fuzzy matching to reconcile disparate databases is one of the most common problems in master data management, entity resolution, and data management in general. Aggregating financials for one entity or one person is nearly impossible when there are multiple records for that entity or person. Fuzzy matching is the foundation for solving the golden record problem, also known as entity resolution, record linkage, and master data management. Entity resolution is not covered in this paper but covered very well by Glenn Wright in his SAS paper titled “Probabilistic Record Linkage in SAS”. My paper presents some creative ways to conduct fuzzy matching on person names, entity names, addresses, and other relevant fields commonly utilized during matching. Hopefully you will find some of the code useful in your fuzzy matching endeavors!
Introduction
I have been a fuzzy matching enthusiast for years (aka “geek”). The nuances in different edit distance functions intrigue me and the problems I have faced as a programmer have inspired me to develop my own user-defined fuzzy matching functions. In the recent past with the addition of PROC FCMP, SAS has allowed me the flexibility to develop user-defined functions that have proved invaluable when conducting large matching and entity resolution projects. Some of the challenges I have faced in my career that have warranted enhanced fuzzy matching and entity resolution include:
- Resolving 8,000 different spellings of “Buenos Aires” for Customs and Border Protection (CBP), so that CBP could identify whether a package was sent to an urban or rural area, in the context of homeland security.
- Matching Medicaid providers and recipients to Medicare providers and beneficiaries on a national level, for the Centers for Medicare and Medicaid Services (CMS).
- Reconciling a client’s financial transaction database – payee names – with the OFAC (Office of Foreign Assets Control) Terrorist Watch List, also known as the Specially Designated Nationals (SDN) and Blocked Persons List.
- Matching vendor master files against a company’s employee file, matching upon name, address, phone, SSN/EIN, and email, for the purposes of identifying potential fraud and fictitious vendors or employees.
- Matching vendor master files against the Mail Drop File, which contains a list of all UPS Stores, Fedex, and Parcel Posts across the United States, for the purposes of identifying potential fictitious or fraudulent entities.
Comments are closed