Bag of Words includes punctuation and excluded words

I am new to KNIME. I want to use KNIME to analyze Software Support Tickets containing unstructured text to identify patterns and trends.

To get started, I created a new workflow following instructions provided by Killian. (Thank you, Killian)

I am getting unexpected results using this workflow shown below and starting from a .csv file.

    •    File Reader  •  Strings to Documents  •  Punctuation erasure  •  Stop word filter  •  N chars filter  •  Snowball Stemmer  •  Case converter  •  Bag of words creator  •  TF • Sorter • Count Sorted

The results from the "Count Sorted" operator include word and punctuation marks I would expect to be omitted as shown below.

Should these words and punctuation marks be in the final results?

Should each word be followed by the double square brackets [ ] ?

Is it possible to edit the Stop Words file? If so, how? 

Thanks for any and all help.

Scott C.

Hi Scott,

I believe, from the Bag of Words node, the first Row with the [] is a tuple notation.  This should be there you can't change it.  It's a visual representation of the tuple.

Yes you can change the Stop Words file, open the Configure screen, go to Filter Options, uncheck the box "Use built-in list" and then add your own list file.  I am adding a screenshot.

Regards

David

Hi Scott,

David is right, the bag of words shows a term and document column. The term column shows the words and the tags. The tags are in brackets. If there are no tags assigned to a term the brackets are empty. To see onl the words of terms you can transfor the terms into strings using the term to string node.

Cheers, Kilian

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.