Help with text processing

Hi there! I have recently been touching on the subject of word processing.
I would like you to guide me with something that I really don’t know how to approach anymore.
I have a lot of data, among which I have a description column, where I am applying word processing.
I have the key words of this information, but I would like to pivot those words, have all my information and sum the words “If you find this word in this description, sum as many times as you find it in this description”.
For example, I have the column for event number and description. I want to add as a pivot the keywords “Fell, dangerous”. If in description appears 2 times fall. that appears :

Event number Description Fell Dangerous
0002 “He fell down the stairs. Earlier, he fell from a very dangerous place” 2 1

I would be very grateful

Hi,
Do you have the KNIME Textprocessing extension installed already? If yes, use a Strings To Document node to convert your description to a document. Then you can use the Bag Of Words Creator node to “flatten” the document. Now use Term To String to convert the term column to a string column and then GroupBy document and word and use the aggregation function “Count” on some other column (if there is none, create a dummy column with the Constant Value Column).
Kind regards,
Alexander

2 Likes

Hello Alexander. I had all that, but to get an approximation of what I wanted, I used “Document to Vector”. The thing is that it’s binary, if it’s there you put a 1 and if it’s not there you put a 0. But I want it to appear the number of times the word is in the text, if it’s there twice, I want it to put a 2 and not a 1.

In that case, you can use a node to calculate frequency ahead of the Document Vector - it sounds like you might want a TF node. Then you just uncheck the Bitvector option in the Document Vector node and select your frequency measure in the dropdown.

2 Likes

Thank @ScottF !!!
It worked for me.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.