Flag similar rows

JWebb · July 19, 2022, 2:59pm

Hello everyone,

I have a dataset which is rather unclean and there are various people who have different rows with different IDs and differently-spelt names but who are actually the same person! Is there a way Knime can investigate these and flag those who are potentially duplicates? For example, there may be someone called “Julie Smith”, “JuIie Smith” (capital “i”, not an L!) or “Julie Jane Smith” in the same dataset. I have tried a short forename (3 char) surname merged field to account for longer and middle names, but that still includes some with swapped-out letters.

I feel like Knime probably has some data cleaning tools to help with this, but I am not familiar with them. Can anyone point me in the right direction? Thanks in advance!

lelloba · July 19, 2022, 3:09pm

Hello @JWebb,

what I’d do is the following.

Turn all names into lowercase with first letter capitalised (JWebb → Jwebb; lelloba → Lelloba)
Calculate similarity between names using something similar to this (string distance) matching problem – KNIME Hub
If distance is low or very low (decision is up to you), then we have a match, otherwise it’s just similar

Does it make sense to you?

RB

JWebb · July 20, 2022, 8:09am

Thanks for this. This looks like exactly what I need, but I cannot get the workflow to download. Any tips on the kinds of settings to use? I have not used these nodes before! I am getting a distance of 0 because NameColumn is obviously matching 100% perfectly with NameColumnDup! Thanks

lelloba · July 20, 2022, 8:14am

Can you share a workflow? If data is sensitive make a small dataset with table creator

JWebb · July 20, 2022, 10:48am

I might be a while because I will need to create some dummy data, but yes I can have a go!

JWebb · July 20, 2022, 12:54pm

I have just realised I actually don’t know how to share a workflow either! I have one, but I have no idea how to put it here. My apologies. Any advice on what to put as the “column selection” on the string distance node and the representative column in the similarity node?

lelloba · July 20, 2022, 1:07pm

This should help

JWebb · July 20, 2022, 2:08pm

Dummy Data.knwf (10.1 KB)

Sorry, that was actually really obvious and simple! Here you go

lelloba · July 20, 2022, 2:26pm

Try this:

RB

JWebb · July 20, 2022, 3:19pm

Every time I try and drag this to Knime, I get this message:

WARN HubURIImporter Hub request failed
WARN ExplorerURIDropUtil Object at URI ‘Dummy Data – KNIME Hub’ not found

Same thing as the first one. I’m not sure what is causing that so I will have to try and resolve that one first before I can investigate your suggestions. Thanks though! Hopefully, I can get it working soon.

lelloba · July 20, 2022, 3:41pm

Try to download it manually and then open it in KNIME.

immagine

mlauber71 · July 20, 2022, 8:49pm

@JWebb here is an approach how to group addresses that are similar. Some additional aspects are being discussed

JWebb · July 21, 2022, 7:16am

Thanks, I finally got it. It was to do with a strange proxy setting when working in the office. Now I am back home it worked first time! Thanks for the help; I can see how it works and I will try and use it in my real data to flag the duplicates

JWebb · July 21, 2022, 8:42am

So nearly there! I assume it is something to do with the distance calculation method, but I am having trouble flagging “-ski”/“-sky” names are being closely-related. I assume the algorithms used are not treating the i and y as similar sounds as is the case in actual language. Is there any way to account for that? Otherwise, it’s working really well!

lelloba · July 22, 2022, 10:17am

Hello JWebb,

maybe you can decide to substitute all -ski with -sky from the beginning, so that you already know you won’t have this problem anymore.

Take this example. In Italian, there is no “ñ” in our alphabet. But if there is a Spanish name inside the list, it might be written with or without it, depending on the person who wrote the name. To be sure to identify the name, I’d convert all “ñ” to plain “n”.

Have a nice day,
RB

system · July 29, 2022, 10:17am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.