Here is a problem I was faced with at work last week: Individuals who had two different kinds of relationships with an organization had two different contact records stored in two different files in a legacy database. The client wants to collapse those contact records down into one if possible, so we had to do a comparison between then.
- Names were concatenated in one file and stored separately in the other.
- Some addresses didn’t match (which is what we were checking in the first place), but even if they did, they were often entered differently between the two files (APT vs #, ST vs St., S vs South).
- In one file, all postal codes were stored in one column. In the other file, domestic postal codes were stored in one column and international postal codes were stored in another column.
- One file had about 30 duplicates for some reason. This means there were multiple entries in the database, but for some reason they all contained exactly the same information.
First things first: I got the information in a CSV file for comparison. I pulled this into Excel. Thankfully there was a unique ID between the two files that matched, so I was able to quickly deduplicate the first file and get the number of entries to match. Then I sorted the two files in the same order.
Here are the formulas I used for comparisons:
Compare cells on two different sheets and put out Yes if they match, No if they do not:
Compare the first 5 characters of two cells instead of the whole contents. This is how I solved the whether or not addresses “matched” even if they have slight issues (APT vs #, ST vs St., S vs South). This won’t work for every situation, but I spot checked the cells and didn’t find any issues in my data set.
Counting the number of NOs in a particular column:
Count the number of YESes or NOs in a particular row. This will give you a good indicator of whether or not a row matches overall: