Bag of Words includes punctuation and excluded words

I am new to KNIME. I want to use KNIME to analyze Software Support Tickets containing unstructured text to identify patterns and trends.

To get started, I created a new workflow following instructions provided by Killian. (Thank you, Killian)

I am getting unexpected results using this workflow shown below and starting from a .csv file.

    •    File Reader  •  Strings to Documents  •  Punctuation erasure  •  Stop word filter  •  N chars filter  •  Snowball Stemmer  •  Case converter  •  Bag of words creator  •  TF • Sorter • Count Sorted

The results from the "Count Sorted" operator include word and punctuation marks I would expect to be omitted as shown below.

Should these words and punctuation marks be in the final results?

Should each word be followed by the double square brackets [ ] ?

Is it possible to edit the Stop Words file? If so, how? 

Thanks for any and all help.

Scott C.

Hi Scott,

I believe, from the Bag of Words node, the first Row with the [] is a tuple notation.  This should be there you can't change it.  It's a visual representation of the tuple.

Yes you can change the Stop Words file, open the Configure screen, go to Filter Options, uncheck the box "Use built-in list" and then add your own list file.  I am adding a screenshot.

Regards

David

Hi Scott,

David is right, the bag of words shows a term and document column. The term column shows the words and the tags. The tags are in brackets. If there are no tags assigned to a term the brackets are empty. To see onl the words of terms you can transfor the terms into strings using the term to string node.

Cheers, Kilian