A question on the IDF computation

Hi, I am new to KNIME and I think this text processing feature is very useful. However I am wondering why the formula for the IDF is 

idf(t) = log[1 + (f(D) / f(d,t))]

instead of the usual formula used in the literature

idf(t) = log[(f(D) / f(d,t)]

where f(D) is the number of all documents and f(d,t) is the number of documents containing term t.

I ask because I am interested in getting an aggregate measure of the IDF for different terms with the same "classification". For example, terms like "water", "soda", and "coffee" fall under the classification "drinks." Basically my goal is to get the TF-IDF of "drinks". Getting the aggregate TF for all drink terms is straightforward. But I also want to get the aggregate IDF, and I do not know if the formula used by KNIME allows me to add the IDFs of water, soda and coffee to get the IDF of drinks.

Thanks in advanced!


Hi Vigile,

basically the +1 is simply to avoid -Inf results for f(d,t) = 0. However, this would not be possible in a bag of words representation anyway (http://de.wikipedia.org/wiki/Inverse_Dokumenth%C3%A4ufigkeit).

IDF for terms in subsets of documents associated to classes are computed by the ICF (inverse category frequency). Assign the classification (class label) as categories to the documents, e.g. using the Strings to documents node.

Cheers, Kilian