Ideas for solution to join multi-column names

Hello,

I wonder if anyone can help me with a workflow solution. I have two sets of data that I need to join by customer names, but in one set of data the names are not consistent.

The first set has first name and last name, one in each column. The second set has the full name in the first column, starting with the last name… sometimes with commas and sometimes without.

Some of the time people will use middle names, some of the time the main contact will appear under contact 2 instead of contact 1, while other times two contacts will be shown under contact 1.

There are of course duplicates of common names (e.g. two different ‘Smiths’ and two different ‘Johns’).

I split cells on set 2 by space after using the clean string node to remove punctuation and special characters, but that presented new problems

I’ve added some dummy data below to show what this looks like, of course not all of the data will necessarily connect - so I need to do a full join showing both outers, too. But much of it will connect if I knew how.

Any help with what to do here, would be very much appreciated.

Set1:

Column 1 Column 2 Column 3 Column 4
First Name Last Name First Name2 Last Name 2
John Smith
Dave Smith
John Jones
Robert Brown Jenny Brown
James King
Bertrand Russell Sarah Russell
Deborah Franklin Benjamin Franklin

Set2:

Column 1 Column 2 Column 3 Column 4
Contact 1 Contact 2
Smith, John
Smith, Dave
Jones John
Brown Campbell Robert
Foley-King, James
Russell Bertrand / Sarah
Franklin Benjamin

Set2 (split):

Column 1 Column 2 Column 3 Column 4 E
Contact 1 Contact1a Contact1b Contact1c
Smith John
Smith Dave
Jones John
Brown Campbell Robert Campbell Robert
King Foley James
Russell Bertrand / Sarah Bertrand Bertrand Sarah
Franklin Benjamin

option 1) go back to the data owner and tell them to get their ■■■■ sorted.

option 2) concat row splitters and trim down your data set to have finally a single rule per case covered. seeing those examples, i would first make everything uppercase and trim leading and trailing whitespaces because the dataset sucks and likely has more hidden surprises.

then you can start a rule based approach and nest it, e.g. splitting those that have only input in column 1, then split those that have a comma in the upper path, and likely split those that have input in the first two columns for the other path.

in the end, you will likely still have data that needs interpretation to figure out which is the first and which is the surname. A dict can help here but good luck figuring out if someones name is Paul Michael or Michael Paul

1 Like

Hi @BSLsteve below are hints/solution in the simple programing methodology not in Kinme’s.

1st step - Remove duplicate rows from the 1st set on the “First Name” and “Last Name” Basis.

2nd step - Then Break 2nd set column1 in to “First Name” and “Last Name” column and clean both the columns with the help of whichever node you like ( mainly “String Manipulation”).

3rd step - Join both the Data Sets on the “First Name” and “Last Name” both column basis. Keep in mind you must join both the data set using both columns.

Thanks, on step 2 - how do I get around that some have middle names? When I tried this, it split them into three columns - sometimes column 2 had a first name, but sometimes column 2 would be a middle name and column 3 had the first name

Hi @BSLsteve use “Cell Splitter Node” and specify Delimiter space or comma.

Try using a Column Aggegator node to count the number of columns in each row. Use an Expression node to rejoin either the 1st and 2nd Columns or 1st and 3rd depending on whether there are 2 or 3 columns. Then resplit the result.

@rfeigel Thank you, this sounds like what I’m in need of. Would you be so kind as to elaborate a little?

Try this workflow. It cleans and organizes the names from both sets and joins them where possible. Its pretty hardwired to the data you provided and may not work if the data format changes.

2 Likes

@BSLsteve Have you looked at the workflow I posted? Thoughts?

Ah, sorry, I did not see this - I will get it now, thank you

It works perfectly, thank you very much @rfeigel :slight_smile:

I have to say, this is such a kind, helpful community. I have asked a few questions on here, being a novice and all - and each time it has been met with solutions where people have gone above and beyond to help. It says a lot, I think.

You’re welcome. Glad I could help.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.