track similar words of a specific column and use the most overlapped word for each line entry


I try to track similar words of a specific column and use the most overlapped word for each line entry . In the picture below, you can see the specific column called “applicants”.

Can you recommend text processing method how to filter terms more precisely?

I do not like to have any spaces, commas or “[”, “]” within my column.

Just an example if I have the terms “AUDI AG [DE]” , “AUDI AG” , “AUDI NSU AUTO UNION AG”, …

I only like to consider AUDI for every row entry.

There are other cases where I have different values and cannot preceed with just saying “Write the First Term of the Row entry”. This only would fit to the case of AUDI and some others.

But can you recommend a logic, where I can track the most used terms and then further proceed with the replacing for example “AUDI” with all entries where AUDI is a specific term?

I know I have created a similiar topic before on that but, I think I need some other logic next to just saying “take the first two terms of each row” within the column.

Thank you very much for your help!
VW_Patentdatenbank_2.xlsx (934.8 KB)

BR Bastian

Hi @8bastian8 , it seems like the picture you are referencing to was not included.

Regarding your example:

What’s the logic that makes AUDI the chosen word as opposed to AG? AG repeats as many times as AUDI.

1 Like

Hello @bruno29a ,

thanks for your answer!

the logic is to count the frequencies of known industry companies (Terms) of the column applicants.

Due to this, you are right, that AUDI will be counted as many as AG, but I like to give the column specific values. These Values could be AUDI, VW, PORSCHE, SCANIA, SEAT, SKODA etc… So I like to have one term that will be replaced for each possible term within the word family, as I mentioned in the audi example.

Is there are logic module given in KNIME to cluster the terms more precisly?

The excel file is attached.

Many thanks, I hope the explanation is clearer.

Best regards

in the text processing nodes there is a nchar filter node. So assuming you only want to take words into account with at least lets say 4 letters you could apply this node in your workflow.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.