I was delighted to discover the Unique Term Extractor node recently - at least I assume it is a new addition! It seems to be a much quicker way to build a term list and calculate term frequencies than the usual method of starting with a bag of words.
However, it is giving me frequencies that can’t be right. The dataset I’m working with contains 479 documents, but according to the output of the node, the top 10 terms all have DF higher than this. The highest DF (‘and’) is 3362, which is seven times higher than the number of documents.
Am I misinterpreting this statistic, or is there something wrong with the node?