Incorrect number of terms reported by Document Data Extractor and Sentence Extractor

AngusVeitch · March 18, 2016, 6:39am

When I use the Document Data Extractor to tell me the number of terms in each document, the value it generates is consistently 10-15% larger than the word count provided by either MS Word or Notepad++ for the same text.

The same thing happens with the Sentence Extractor. Even with a sentence of, say, six words, it reports the sentence length as seven or eight terms.

Can anyone explain why this is happening? Is Knime using a different defintion of a term to everyday use, or is this a bug?

I've observed the same result in versions 2.12. and 3.1.1.

kilian.thiel · March 18, 2016, 2:05pm

Hi Sugna,

this is most likely due to the different tokenization. Using the basic word tokenization in KNIME punctuation marks will end up as tokens und thus be counted as terms/words.

To remove terms that are punctuation marks, use the Punctuation Erasure node and count again on the preprocessed document.

Cheers, Kilian

AngusVeitch · March 20, 2016, 3:04pm

Thanks Kilian, removing punctuation seems to have fixed it.

I can't be the only person to have made this 'mistake', though. I wonder if some clarifying remarks should be included in some of the relevant node descriptions. (Or perhaps they are alread there and I didn't see them...)

kilian.thiel · April 5, 2016, 7:26pm

Hi Sugna,

yes, you are right, thank you for pointing this out.

Cheers, Kilian

system · June 2, 2023, 9:48pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.