Understanding the nGram creator node

Hello everyone,

I'm having trouble understanding the nGram creator node (output table = nGram frequencies):

Obviously, document frequency explains  "in how many documents the respective nGram occurs" and sentence frequency vice versa. Accordingly, the corpus frequency "should" explain how many times the nGram occurs in the "corpus". However, which corpus does this refer to? Some predefined corpus like the brown corpus? Or is the corpus equal to the whole "bag of words". 

Any explanation is much appreciated!


Hi 8mm,

the corpus are all the documents in the input table of the N Gram node. Basically this is simply the list of input documents.

Cheers, Kilian

Hi Kilian,

thank you for your response! Could you briefly outline the difference between the corpus frequency and the document frequency then? 


Hi 8mm,

the term frequency (TF) is the frequency of a term in a document. The inverse document frequency (IDF) is the inverse number of documents (of a corpus) that contain a certain term. See: http://en.wikipedia.org/wiki/Tf%E2%80%93idf for details.

Cheers, Kilian