Fuzzy search on set of names to find similar names by comparing each name with all others

Hello Knime Experts!

I am fairly new to KNIME and I have been working with this tool since past 2 months now.
I have a doubt on fuzzy matching (text processing).
I have set of 600 names as a column (first name, middle name- optional and last name).
I want to get all the set of similar strings from this column. I did find few workflows which does this but it does by comparing one name of one column with another name in different columns.
I am not sure of any discussion relating to getting set of similar words from one column.
The algorithm of fuzzy search to be used should preferably be an algorithm which gives percentage of similarity like cosine similarity.

I do not have any words to compare the name field with. I have a single column with full name of people and I need to get set of similar words from these itself, so each name should be compared to all other 599 words and the set of words which are similar are to be marked as similar and those words are then checked with percentage of similarity it holds.

Kindly help me resolve this or direct me to any document/discussion which helps in the same.

Thanks in advance!
Mahima

1 Like

You can try HansS solution here

3 Likes

hy @mahima_goyal can you provide some sample data set so that we can create an example for you?

thank you

Hey @natanaeldgsantos
Thanks for the quick reply!
Sure I am sending it some sample data here (with very less data set).
Attaching screenshot from excel along with excel file which describes what I need as output:


Book1.xlsx (9.5 KB).

I hope this clears what problem I am trying to resolve here.

Thanks in advance!
Mahima.

1 Like

@izaychik63 Thanks alot for this discussion link, I will go through this once to see if this resolves my problem :slight_smile:

hi @mahima_goyal

Attached I send an example workflow

This Workflow compares each register with the others, calculating the distance or similarity between them.

Always returns all similar ones above 50%

each of the algorithms is optimized to consider case sensitive when performing the calculations.

I use a flow variable to force the comparison of each record with all others in the column, assigning the number of records in the table to the “Neighbor Count” field of the Similiarity Search node.

please check if the example is useful for your need.

thanks
new Fuzzy.knwf (79.7 KB)

2 Likes

@natanaeldgsantos , interesting solution. On other side dimensions extraction looks too complex. Please see alternative below

3 Likes

@natanaeldgsantos Thanks alot for the solution this helps too and also the solution given by @izaychik63 also is useful for the same and seems more simple to me :slight_smile:
Thanks alot to all you experts!!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.