Wrong DF count from unique term extractor?

AngusVeitch · September 2, 2019, 10:49am

I was delighted to discover the Unique Term Extractor node recently - at least I assume it is a new addition! It seems to be a much quicker way to build a term list and calculate term frequencies than the usual method of starting with a bag of words.

However, it is giving me frequencies that can’t be right. The dataset I’m working with contains 479 documents, but according to the output of the node, the top 10 terms all have DF higher than this. The highest DF (‘and’) is 3362, which is seven times higher than the number of documents.

Am I misinterpreting this statistic, or is there something wrong with the node?

julian.bunzel · September 3, 2019, 2:11pm

Hey @AngusVeitch,

you are correct. There is a bug in the process of counting the document frequency. Apparently, it’s the sentence frequency that is counted right now.

Thank you for reporting it, I created a ticket.

Cheers,

Julian

julian.bunzel · October 2, 2019, 10:38am

Hey @AngusVeitch again,

a new version of KNIME is available (4.0.2) which includes the fixed version of the Unique Term Extractor. DF is now counted correctly.

Cheers,
Julian

system · June 2, 2023, 9:44pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.