Hey,
I’m new to more extensive text processing within knime so I’m not really sure where to start. I have the following issue:
I have a column which contains products however, these products might be missspelled or sometimes just written slighty different.
For my task I am looking for a few specific products but I would also have to include all possible misspelled variants without knowing how they are missspelled.
Example:
I am looking for Coca Cola and I might have the following entries:
C0caCola
Vanilla Cake
CocaCola 1l
Coca Cola
Fanta
Freeway Cola
Bourbon
Cocoa Powder
How would I be able to filter for my wanted products?
I’ve tried to do it with Regex but I think I still miss quite a few rows with the regex.
I’ve found a workflow which used the String Distance Node and the Hierarchical Clustering Node, however I’m not quite sure how to use them in order to get what I want.
I also tried the String Matcher Node, but the biggest issue with this one is, that there is no option for a Wildcard. So rows like “Coca Cola 1,5l” would get a higher distance score and would be most probably filtered out.
Maybe as a little addition, my current dataset has about 5.600.000 rows so the solution should be able to handle bigger data sets.
All help is appreciated
Thanks in advance!