I have a dataset which is rather unclean and there are various people who have different rows with different IDs and differently-spelt names but who are actually the same person! Is there a way Knime can investigate these and flag those who are potentially duplicates? For example, there may be someone called “Julie Smith”, “JuIie Smith” (capital “i”, not an L!) or “Julie Jane Smith” in the same dataset. I have tried a short forename (3 char) surname merged field to account for longer and middle names, but that still includes some with swapped-out letters.
I feel like Knime probably has some data cleaning tools to help with this, but I am not familiar with them. Can anyone point me in the right direction? Thanks in advance!
Thanks for this. This looks like exactly what I need, but I cannot get the workflow to download. Any tips on the kinds of settings to use? I have not used these nodes before! I am getting a distance of 0 because NameColumn is obviously matching 100% perfectly with NameColumnDup! Thanks
I have just realised I actually don’t know how to share a workflow either! I have one, but I have no idea how to put it here. My apologies. Any advice on what to put as the “column selection” on the string distance node and the representative column in the similarity node?
Every time I try and drag this to Knime, I get this message:
WARN HubURIImporter Hub request failed
WARN ExplorerURIDropUtil Object at URI ‘Dummy Data – KNIME Hub’ not found
Same thing as the first one. I’m not sure what is causing that so I will have to try and resolve that one first before I can investigate your suggestions. Thanks though! Hopefully, I can get it working soon.
Thanks, I finally got it. It was to do with a strange proxy setting when working in the office. Now I am back home it worked first time! Thanks for the help; I can see how it works and I will try and use it in my real data to flag the duplicates
So nearly there! I assume it is something to do with the distance calculation method, but I am having trouble flagging “-ski”/“-sky” names are being closely-related. I assume the algorithms used are not treating the i and y as similar sounds as is the case in actual language. Is there any way to account for that? Otherwise, it’s working really well!
maybe you can decide to substitute all -ski with -sky from the beginning, so that you already know you won’t have this problem anymore.
Take this example. In Italian, there is no “ñ” in our alphabet. But if there is a Spanish name inside the list, it might be written with or without it, depending on the person who wrote the name. To be sure to identify the name, I’d convert all “ñ” to plain “n”.