When I use the Document Data Extractor to tell me the number of terms in each document, the value it generates is consistently 10-15% larger than the word count provided by either MS Word or Notepad++ for the same text.
The same thing happens with the Sentence Extractor. Even with a sentence of, say, six words, it reports the sentence length as seven or eight terms.
Can anyone explain why this is happening? Is Knime using a different defintion of a term to everyday use, or is this a bug?
I've observed the same result in versions 2.12. and 3.1.1.
this is most likely due to the different tokenization. Using the basic word tokenization in KNIME punctuation marks will end up as tokens und thus be counted as terms/words.
To remove terms that are punctuation marks, use the Punctuation Erasure node and count again on the preprocessed document.
Thanks Kilian, removing punctuation seems to have fixed it.
I can't be the only person to have made this 'mistake', though. I wonder if some clarifying remarks should be included in some of the relevant node descriptions. (Or perhaps they are alread there and I didn't see them...)