BUGs in IDF- and TF-Nodes?

KNIME Version: KNIME 3.5.2

 

I think there might be bugs in the IDF- and TF-Nodes: I have three documents. I have a term that appears three times in one the documents and not at all in the other two documents.

TF-Node

I expect TF_absolute to be 3 for one of the documents and 0 to be for the other two. Instead I get 14 for the one document and 0 for the others. My expectation stems from using the "Bag of Words Creator" Node. I am not sure

Since the document, containing that term has 137 terms, I would expect TF_relative to be 3/137=0.022, instead I get 0.067. That is quite the difference.

IDF-Node

The formulas of the IDFs are (as written in the node's documentation)

idf_smooth(t) = log(1 + (f(D) / f(d, t)))
idf_normalized(t) = log(f(D) / f(d,t)). 
idf_probabilistic(t) = log((f(D) - f(d,t)) / f(d,t))

where f(D) is the number of all documents and f(d,t) is the number of documents containing term t.

So, here f(D) = 3 and f(d,t) =1, thus I would expect to get

idf_smooth(t) = log(4) = 0.602
idf_normalized(t) = log(3) = 0.477
idf_probabilistic(t) = log(2) = 0.301

instead I am getting

idf_smooth(t) = log(4) = 0.301
idf_normalized(t) = 0
idf_probabilistic(t) = log(2) = ?

for the term.

Important to note is though: When I use the "Bag of Words Creator" Node directly on the documents and calculate the IDF_smooth for my term, I get the expected result. The issue appear in my setting where I create my own keyword terms, cross join them with the documents and use IDF on that.

Also to note: I stripped all tags from the document with the "Tag Stripper" Node - just in case this might cause the issue. So that is not it. Although... there also might be something wrong with the "Tag Stripper" Node: https://www.knime.com/forum/knime-textprocessing/bug-in-tag-stripper-node-or-in-groupby-node

IDF-Definitions

As a side note: It might seems that everyone is defining idf_smooth differently, as http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html and https://en.wikipedia.org/wiki/Tf%E2%80%93idf are defining it differently than the KNIME-node. Not quite sure what to make of this.

Have you checked these threads already?

  • https://www.knime.com/forum/knime-textprocessing/problem-with-term-frequency
  • https://www.knime.com/forum/knime-textprocessing/understanding-the-bow-tf-and-idf-nodes-in-textprocessing

Hi,

have you applied a tagger node before. Terms are also differentiated based on their tags not only on their words. Also you need to make sure select the right document column. If you filter document e.g. by stop word filtering your are changing the number of terms in that document. The number of distinct terms is also changed when you apply stemming.

Cheers, Kilian

Yes, kilian, as I wrote, I used a tagger node before. But then, after filtering tagged words, I used a Tag Stripper node to get rid of them. My TF / IDF / TF-IDFs work on this stripped version. I did not count the number of words based on the original version of the document, but on the filtered and stemmed version.

I some more inspections with the Document Viewer node and checked with the Bag of Words Creator node that the tags were stripped. Maybe this is right afterall... but then the Bag of Words node that I mentioned in my first post might be erronous, outputting too few instances of words. Is the Bag of Word Creator not outputting each instance of each word, but only one instance (row) per word for each document?

I made a demo, seems that indeed the Bag Of Words Creator node (BOW) only creates one row for each word per document instance, no matter the number of occurrences of the word. (For what a "document instance" is check out https://www.knime.com/forum/knime-textprocessing/bug-in-tag-stripper-node-or-in-groupby-node#comment-29690.)

A lot of the following problems occurred, because a bag in mathematics is a different word for a multiset (https://en.wikipedia.org/wiki/Multiset), which can contain an element multiple times - unlike a set. Analogous, a Bag of Words counts how many times a word appears in a document (https://en.wikipedia.org/wiki/Bag-of-words_model). This is not what is done in the Bag Of Words Creator node though. For that an additional TF node after the BOW is required. On the one hand this adds a lot of flexibility (which I love), on the other hand, it should be documented since it does not seem to be consistent with literature (as far as I can see).

Additionally, the BOW does consider the tags given, which makes a difference if you first tag the words:

Now an interesting effect can happen: If you stem the words before the BOW, the BOW will create a row per document for each unique "stemmed word plus term". If one applies now a Term To String node and checks out the stemmed words, one might see more than one row of the stemmed word per document. This might confuse because it looks like the BOW created multiple rows for this stemmed word, but the reason we see multiple rows is because there were the tags, which are not seen anymore in the string column created by the Term To String node.

There is another important aspect: The BOW also considers the document title. If one does not want this, one would have to remove the title. One would think this could be done with the Document Data Assigner node, but no luck there. At this point, it is most likely going to get ugly: At the very beginning of the work flow one would have to extract the text of the document, and the title in separate string columns, and turn the text back into a document column, leaving the title blank here.

Whether the title also gets tagged and contributes to the whole process regarding the BOW as described above, I do not know yet. If it does not get tagged one might have another instance (without tags on one's hand).

To my knowledge, none of this is documented and was inferred by me (thus I am not 100% confident what I wrote is right, please correct me if I am wrong). This is not so great - it caused me a lot of headache and cost a lot of development time.

Suggestions that might help working with these nodes:

  • The BOW could have a check box, whether tags should be considered or not.
  • The BOW could have an option to choose whether the title should be considered or not.
  • The BOW could document that only unique instances get rows and that the TF node should be used if the number of instances is required.
  • The TF node could be have check boxes whether one only wants words to be considered, or also tags.
  • The TF node could have an option to choose whether the title should be considered or not.
  • The Document Data Assigner node could have the option to set a title.

Adding the options should implicitly add most of the documentation needed.

 

Back to the original Problem of the results of IDF and TF: TF is working now as expected, but IDF is still not.

I add a demo: I create three documents and I create a keyword that I want to calculate the IDF for. On the one hand I calculate the IDF "by hand" with nodes (top) and on the other hand with the IDF node.

The documents are:

"Begriff Begriff Begriff Wort Wort"

"Begriff Wort"

"Begriff Begriff"

and the word is "Wort".

So I expect a IDF = log(1+3/2) = 0.39794 according the documentation, but I get from the IDF node 0.301.

What am I still doing wrong?