Flag similar rows

Hello everyone,

I have a dataset which is rather unclean and there are various people who have different rows with different IDs and differently-spelt names but who are actually the same person! Is there a way Knime can investigate these and flag those who are potentially duplicates? For example, there may be someone called “Julie Smith”, “JuIie Smith” (capital “i”, not an L!) or “Julie Jane Smith” in the same dataset. I have tried a short forename (3 char) surname merged field to account for longer and middle names, but that still includes some with swapped-out letters.

I feel like Knime probably has some data cleaning tools to help with this, but I am not familiar with them. Can anyone point me in the right direction? Thanks in advance!

Hello @JWebb,

what I’d do is the following.

  • Turn all names into lowercase with first letter capitalised (JWebb → Jwebb; lelloba → Lelloba)
  • Calculate similarity between names using something similar to this (string distance) matching problem – KNIME Hub
  • If distance is low or very low (decision is up to you), then we have a match, otherwise it’s just similar

Does it make sense to you?

RB

1 Like

Thanks for this. This looks like exactly what I need, but I cannot get the workflow to download. Any tips on the kinds of settings to use? I have not used these nodes before! I am getting a distance of 0 because NameColumn is obviously matching 100% perfectly with NameColumnDup! Thanks

Can you share a workflow? If data is sensitive make a small dataset with table creator :slight_smile:

I might be a while because I will need to create some dummy data, but yes I can have a go!

1 Like

I have just realised I actually don’t know how to share a workflow either! I have one, but I have no idea how to put it here. My apologies. Any advice on what to put as the “column selection” on the string distance node and the representative column in the similarity node?

This should help :slight_smile:

1 Like

Dummy Data.knwf (10.1 KB)

Sorry, that was actually really obvious and simple! Here you go :slight_smile:

Try this:

RB

1 Like

Every time I try and drag this to Knime, I get this message:

WARN HubURIImporter Hub request failed
WARN ExplorerURIDropUtil Object at URI ‘Dummy Data – KNIME Hub’ not found

Same thing as the first one. I’m not sure what is causing that so I will have to try and resolve that one first before I can investigate your suggestions. Thanks though! Hopefully, I can get it working soon.

Try to download it manually and then open it in KNIME.

immagine

2 Likes

@JWebb here is an approach how to group addresses that are similar. Some additional aspects are being discussed

2 Likes

Thanks, I finally got it. It was to do with a strange proxy setting when working in the office. Now I am back home it worked first time! Thanks for the help; I can see how it works and I will try and use it in my real data to flag the duplicates :slight_smile:

1 Like

So nearly there! I assume it is something to do with the distance calculation method, but I am having trouble flagging “-ski”/“-sky” names are being closely-related. I assume the algorithms used are not treating the i and y as similar sounds as is the case in actual language. Is there any way to account for that? Otherwise, it’s working really well!

Hello JWebb,

maybe you can decide to substitute all -ski with -sky from the beginning, so that you already know you won’t have this problem anymore.

Take this example. In Italian, there is no “ñ” in our alphabet. But if there is a Spanish name inside the list, it might be written with or without it, depending on the person who wrote the name. To be sure to identify the name, I’d convert all “ñ” to plain “n”.

Have a nice day,
RB

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.