Renaming a string in a row based on another column

Gdjesi · April 17, 2019, 8:16pm

Hello all,

i am working on a small project at school. Actually i am at one point where i have no clue how to go further. I have a column with company names where due to spelling mistakes the company name can be spelled wrong. For example the company name has 2 instead of 1 space, a letter is missing or the dot notation is wrong. In this case i have done a count on the company names and the one with the most hits is the right name.

I would like to change all the other columns that are similiar to the one with the highest count to the right name of the row with the highest counts. Attached you can see example (the green fields should be renamed in the oranged one). Maybe some of you guys have a better idea how to solve this Every hint would be a lot of help for me. I have tried different things from Manipulation until Regex, but nothing works out…

52%20PM

Thanks in advance

ScottF · April 17, 2019, 8:27pm

Hi @Gdjesi -

This workflow on fuzzy matching might get you on the right track:

https://hub.knime.com/knime/workflows/Examples/08_Other_Analytics_Types/01_Text_Processing/09_Fuzzy_String_Matching*vZLbH1jBCR6FXmhR

Gdjesi · April 17, 2019, 8:46pm

Hello ScottF,

thank you for the advice. I already had a look on it, but my problem is that i only have one table and i dont know how to get the right names out of it to get into the fuzzy matching.

I only have the information that the right company name is the one with the hightest number of counts.

ScottF · April 17, 2019, 9:01pm

I think you just need to create a lookup table based on strings with large counts. There may be a bit of nuance to this - you’ll have to arbitrarily pick some cutoff based on your dataset - and it may not work 100% for company names with small n. But you could use the Numeric Row Splitter node to make such a lookup table.

izaychik63 · April 18, 2019, 2:13pm

See attached. I assumed that correct name will be grater than 10.
Correct Company Name.knwf (22.1 KB)

Corey · April 18, 2019, 6:12pm

Another option if you’re not confident all companies will have a count above your cut off or if they might have multiple spellings above your cut off would be to use a clustering approach. If you use hierarchical clustering you can set a string distance cut off instead of a cluster number.

After everything is clustered you can group them by the most common spelling and sum of counts. Attached an example below. You can see how close the spellings are in string distance to each other in the hierarchical cluster viewer.

Loving how many different ways we can tackle this problem!

Typo Correction.knwf (14.2 KB)

Gdjesi · April 19, 2019, 8:43am

Thank you all so much, you helped me more than i could expected!!

anon33357744 · April 23, 2019, 11:38am

Hi @Corey,

is there a way to do the same for big data sets, because in the node description there is the Information that it works for small data sets.

Thanks in Advance,

Canan