I'am struggling somewhat with understanding how the "frequency filter" node actually works. A preceding calculation of the "TF" node, for example, results in term frequencies per document. However, the "frequency filter" node filters terms and not terms per document if I understand correctly. So I am wondering how term frequencies are aggregated across documents before filtering. Is it the average value per term across documents?
The issue occured to me as I was trying to filter the, say, 1000 most frequent terms in my overall corpus. How can this be achieved?
I appreciate any pointers!
the frequencies are not aggregated during filtering by the "Frequency Filter" node. If the TF value of a term in a document is above the specified threshold, the term is not filtered, otherwise it is filtered. The filter works simply on the values of the specified frequency column. To aggregate frequencies of terms over documents, and thus compute corpus wide frequencies the "Group By" node can be used. Simply Group over the terms and aggregate the TF values, resulting in the TF values of terms of the complete corpus. After grouping, e.g. the "Sorter" node can be use to sort terms according to their aggregated TF values and then the "Row Filter" node can be used to extract the N top rows, representing the N most frequent terms in the corpus.
Btw.: Using the "Group By" node allows you to not only sum over the TF values, but also to compute Mean, Std. Dev., Min, Max etc. which might be useful as well.
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.