Text Processing - Finding a set of words within a cell

Hey,
I’m new to more extensive text processing within knime so I’m not really sure where to start. I have the following issue:

I have a column which contains products however, these products might be missspelled or sometimes just written slighty different.
For my task I am looking for a few specific products but I would also have to include all possible misspelled variants without knowing how they are missspelled.

Example:

I am looking for Coca Cola and I might have the following entries:
C0caCola
Vanilla Cake
CocaCola 1l
Coca Cola
Fanta
Freeway Cola
Bourbon
Cocoa Powder

How would I be able to filter for my wanted products?
I’ve tried to do it with Regex but I think I still miss quite a few rows with the regex.
I’ve found a workflow which used the String Distance Node and the Hierarchical Clustering Node, however I’m not quite sure how to use them in order to get what I want.
I also tried the String Matcher Node, but the biggest issue with this one is, that there is no option for a Wildcard. So rows like “Coca Cola 1,5l” would get a higher distance score and would be most probably filtered out.

Maybe as a little addition, my current dataset has about 5.600.000 rows so the solution should be able to handle bigger data sets.

All help is appreciated :slight_smile:
Thanks in advance!

I think I might have something for you :slight_smile:

1 Like

@Lena_Schmid welcome to the KNIME forum. I have set up a sample workflow where a task of dealing with texts in chunks is being fed to a local LLM model. Maybe you can modify the input and the prompt and see if this can help.

Or maybe you provide a complete example with a good amount of data.