Is it possible to apply Term Frequency (TF) to individual documents and to set of documents? How to?


(a) By 'individual documents' I consider that TF counts words in each document, being same word could be found in the output column 'term' (e.g. if the word exist in two different documents it will appear twice).


(b) By 'set of documents' I consider that TF counts words in whole set of documents, being that each word would be found just once in the output column 'term'.


For instance, in the attached workflow which contains 3 documents, Term Frequency (TF) works for 'individual documents' (a), since the word 'institutional' appears 3 times in the column 'term'.


I would like to find a way that 'institutional' appears just once, being the frequency calculated considering the set of documents (b).


Many thanks in advance,



The best I can suggest is to subsequently use a groupby node in which you group by Term, and aggregate by sum or average of the TF frequency, depending on what you want.


Yes, that is right. (a) Use the TF node to compute the frequency of a term in a single document (use absolute count in dialog). (b) To compute the frequency of a term in the complete corpus simply aggregate all tf values of a term, by using the Group By node. To compute the frequency of terms in subsets of the corpus, append a column containing the label of the subset and use the Group By node on term column + subset label column.


Cheers, Kilian

