I’ve been searching, but haven’t found a solution to what I need (I apologize if there is an answer already).
I’ve read about fuzzy matching, string similarity, the Levenshtein algorithm, but I don’t think they apply to my problem.
I have two tables with names, and I need to compare and find the matches between them.
The problem is that the names are not exactly the same between the databases: from one I can get the full name and surname, while the other has only the first letter, or maybe a middle name included, or vice-versa.
I join the tables from a common column (“Room” in the example workflow attached), and I need to filter the rows to give me only the results with the single name match, and the “Fare” columns from both databases, so I can compare the differences.
Also, in my task, both tables have millions of rows, so it would need to be a rather efficient method to do this.
Hi,
I don’t understand what exactly the surname and what the first name is in your example. Is that also mixed? If not, I think you need to bring the names to the “lowest common denominator”, for example by always taking only the first letter of the first name and remove any middle names. Then change everything to upper or lower case and you can compare. Does that help?
Kind regards
Alexander
Thanks for the quick response and sorry for taking this long to reply.
The name pattern is like this: “Surname / First Name”.
If there is a middle name, it sometimes goes with the surname, sometimes with the first name. There is no rule for this.
I have tried joining by matching the first letter of the names. But the problem is that, with many names, there will be people with both names starting with the same letters, generating duplicates.
I have removed any special characters and left all the names in upper case.
I guess I would need a solution that breaks down the names and compares if the string from one table is contained in the other, but not necessarily with an exact match (since there might be middle names or only part of the name in the mix).
@izaychik63, thanks for the suggestion, it really helped me!
I broke the strings down and compared separately the distances for the last names and then for the first names only (removing any middle names there might be).
This helped me find the best matches.