Bag of Words creator - help please

Hey,

I've searched the forum but can't find anything relating to this. I'm generating my bag of words creator node and its breaking up words that are pivotal to the sentiment analysis I am conducting.

So for example, didn't is becomeing DID and N'T; wasn't is becoming WAS only.

Is there something I'm missing here?

Cheers

Macca

Hi Macca,

this is due to tokenization. Every document is tokenized when it is created. The bag of words node does not break words up. They are already tokenized. The nodes simply restructures the data table using the tokenization that has already been done. For word tokenization the openNLP tokenizer model for english language is used.

To exactly see how the tokenization has been applied use the bag of words node directly after the node that creates the documents, e.g. the Strings to Documents node.

Specifying other tokenizers is on our road map for the next major release 3.3 in December.

Cheers, Kilian