KNIME Version: KNIME 3.5.2
I think there might be bugs in the IDF- and TF-Nodes: I have three documents. I have a term that appears three times in one the documents and not at all in the other two documents.
TF-Node
I expect TF_absolute to be 3 for one of the documents and 0 to be for the other two. Instead I get 14 for the one document and 0 for the others. My expectation stems from using the "Bag of Words Creator" Node. I am not sure
Since the document, containing that term has 137 terms, I would expect TF_relative to be 3/137=0.022, instead I get 0.067. That is quite the difference.
IDF-Node
The formulas of the IDFs are (as written in the node's documentation)
idf_smooth(t) = log(1 + (f(D) / f(d, t)))
idf_normalized(t) = log(f(D) / f(d,t)).
idf_probabilistic(t) = log((f(D) - f(d,t)) / f(d,t))
where f(D) is the number of all documents and f(d,t) is the number of documents containing term t.
So, here f(D) = 3 and f(d,t) =1, thus I would expect to get
idf_smooth(t) = log(4) = 0.602
idf_normalized(t) = log(3) = 0.477
idf_probabilistic(t) = log(2) = 0.301
instead I am getting
idf_smooth(t) = log(4) = 0.301
idf_normalized(t) = 0
idf_probabilistic(t) = log(2) = ?
for the term.
Important to note is though: When I use the "Bag of Words Creator" Node directly on the documents and calculate the IDF_smooth for my term, I get the expected result. The issue appear in my setting where I create my own keyword terms, cross join them with the documents and use IDF on that.
Also to note: I stripped all tags from the document with the "Tag Stripper" Node - just in case this might cause the issue. So that is not it. Although... there also might be something wrong with the "Tag Stripper" Node: https://www.knime.com/forum/knime-textprocessing/bug-in-tag-stripper-node-or-in-groupby-node
IDF-Definitions
As a side note: It might seems that everyone is defining idf_smooth differently, as http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html and https://en.wikipedia.org/wiki/Tf%E2%80%93idf are defining it differently than the KNIME-node. Not quite sure what to make of this.