Bag of Words includes punctuation and excluded words

sccardais · December 4, 2015, 8:29pm

I am new to KNIME. I want to use KNIME to analyze Software Support Tickets containing unstructured text to identify patterns and trends.

To get started, I created a new workflow following instructions provided by Killian. (Thank you, Killian)

I am getting unexpected results using this workflow shown below and starting from a .csv file.

• File Reader • Strings to Documents • Punctuation erasure • Stop word filter • N chars filter • Snowball Stemmer • Case converter • Bag of words creator • TF • Sorter • Count Sorted

The results from the "Count Sorted" operator include word and punctuation marks I would expect to be omitted as shown below.

Should these words and punctuation marks be in the final results?

Should each word be followed by the double square brackets [ ] ?

Is it possible to edit the Stop Words file? If so, how?

Thanks for any and all help.

Scott C.

davidatverizon · December 8, 2015, 4:05pm

Hi Scott,

I believe, from the Bag of Words node, the first Row with the [] is a tuple notation. This should be there you can't change it. It's a visual representation of the tuple.

Yes you can change the Stop Words file, open the Configure screen, go to Filter Options, uncheck the box "Use built-in list" and then add your own list file. I am adding a screenshot.

Regards

David

stop_word_filter.jpg

kilian.thiel · December 9, 2015, 10:46am

Hi Scott,

David is right, the bag of words shows a term and document column. The term column shows the words and the tags. The tags are in brackets. If there are no tags assigned to a term the brackets are empty. To see onl the words of terms you can transfor the terms into strings using the term to string node.

Cheers, Kilian

system · June 2, 2023, 9:48pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.