Wrong DF count from unique term extractor?

#1

I was delighted to discover the Unique Term Extractor node recently - at least I assume it is a new addition! It seems to be a much quicker way to build a term list and calculate term frequencies than the usual method of starting with a bag of words.

However, it is giving me frequencies that can’t be right. The dataset I’m working with contains 479 documents, but according to the output of the node, the top 10 terms all have DF higher than this. The highest DF (‘and’) is 3362, which is seven times higher than the number of documents.

Am I misinterpreting this statistic, or is there something wrong with the node?

1 Like

#2

Hey @AngusVeitch1,

you are correct. There is a bug in the process of counting the document frequency. Apparently, it’s the sentence frequency that is counted right now.

Thank you for reporting it, I created a ticket.

Cheers,

Julian

0 Likes

#3

Hey @AngusVeitch1 again,

a new version of KNIME is available (4.0.2) which includes the fixed version of the Unique Term Extractor. DF is now counted correctly.

Cheers,
Julian

3 Likes