I wonder if anyone can help me with a workflow solution. I have two sets of data that I need to join by customer names, but in one set of data the names are not consistent.
The first set has first name and last name, one in each column. The second set has the full name in the first column, starting with the last name… sometimes with commas and sometimes without.
Some of the time people will use middle names, some of the time the main contact will appear under contact 2 instead of contact 1, while other times two contacts will be shown under contact 1.
There are of course duplicates of common names (e.g. two different ‘Smiths’ and two different ‘Johns’).
I split cells on set 2 by space after using the clean string node to remove punctuation and special characters, but that presented new problems
I’ve added some dummy data below to show what this looks like, of course not all of the data will necessarily connect - so I need to do a full join showing both outers, too. But much of it will connect if I knew how.
Any help with what to do here, would be very much appreciated.
option 1) go back to the data owner and tell them to get their ■■■■ sorted.
option 2) concat row splitters and trim down your data set to have finally a single rule per case covered. seeing those examples, i would first make everything uppercase and trim leading and trailing whitespaces because the dataset sucks and likely has more hidden surprises.
then you can start a rule based approach and nest it, e.g. splitting those that have only input in column 1, then split those that have a comma in the upper path, and likely split those that have input in the first two columns for the other path.
in the end, you will likely still have data that needs interpretation to figure out which is the first and which is the surname. A dict can help here but good luck figuring out if someones name is Paul Michael or Michael Paul
Hi @BSLsteve below are hints/solution in the simple programing methodology not in Kinme’s.
1st step - Remove duplicate rows from the 1st set on the “First Name” and “Last Name” Basis.
2nd step - Then Break 2nd set column1 in to “First Name” and “Last Name” column and clean both the columns with the help of whichever node you like ( mainly “String Manipulation”).
3rd step - Join both the Data Sets on the “First Name” and “Last Name” both column basis. Keep in mind you must join both the data set using both columns.
Thanks, on step 2 - how do I get around that some have middle names? When I tried this, it split them into three columns - sometimes column 2 had a first name, but sometimes column 2 would be a middle name and column 3 had the first name
Try using a Column Aggegator node to count the number of columns in each row. Use an Expression node to rejoin either the 1st and 2nd Columns or 1st and 3rd depending on whether there are 2 or 3 columns. Then resplit the result.
Try this workflow. It cleans and organizes the names from both sets and joins them where possible. Its pretty hardwired to the data you provided and may not work if the data format changes.
I have to say, this is such a kind, helpful community. I have asked a few questions on here, being a novice and all - and each time it has been met with solutions where people have gone above and beyond to help. It says a lot, I think.