Understanding the BoW, TF and IDF nodes in Textprocessing

Hi,

I just used the English version of the “Lorem ipsum”-text to understand the various KNIME nodes and their outcomes in the process of Text Processing. While doing so, I had some questions and I hope that somebody can answer them :-)

My text consists of two rows which you will find in the attached workflow.

1. “Bag of Words Creator”

I don’t get why the “Value Counter” doesn’t show the right values of the words after using the “Bag of Words Creator”. E.g. the word “pleasure” has an actual value of 9 and “pleasures” has 3. However the “Value Counter” shows this distribution:

		<p>pleasure[NN(POS)] = 2</p>
		</td>
	</tr>
	<tr>
		<td>
		<p>pleasure[VB(POS)] = 1</p>
		</td>
	</tr>
	<tr>
		<td>
		<p>pleasures[NNS(POS)] = 2</p>
		</td>
	</tr>
</tbody>

So the classification is different and it doesn’t show the actual value. Does anyone know why?

 

2. Term Frequency (TF)

With my first path I wanted to understand how the TF node works and what the different outcomes are between the TF (relative and absolute) and IDF node. So if you take a look at the output table after the second TF node and sort the TF rel. (descending) it has the following outcome:

pleasure[NN(POS)] = 14

pain[NN(POS)] =10

pain[NN(POS)] =6

pleasures[NNS(POS)] =6

All the other terms have a maximum TF (abs.) value of min. 2. But isn’t this wrong? If a term only appears once in the document, this term should have a TF abs. of 1 – or am I wrong? And additionally why receives the term “pleasure” suddenly a TF abs. of 14?

 

3. Inverse Document Frequency (IDF)

The IDF Value gives information about the frequency of one term in all the documents. As I understand the lowest IDF value is the most frequent word. In my example I have two documents (two rows). However, why does the “Value Counter” only consist of two different valued for the IDF, although there are more words in various frequencies. That is why I don’t understand the Tag Cloud #3 that doesn’t show the most frequent terms in the centre and bold but very strange… Has anybody any clue?

 

4. IDF with Frequency Filter

With the last path I wanted to figure out the most frequent words. However the Tag Cloud #4 shows the 6 most frequent words according to the IDF value but it doesn’t make any distinction of the frequencies. I was expecting that “pleasure” is in bold and the others are not.

I hope that somebody can help me and understand my issues with the mentioned nodes. Thanks a lot for a response. I appreciate it a lot :-)

Greetings from Jasmin

Hello Jasmin,

I can not give deep insights,  but as far as I understand from my experiments:

ad 1: the BoW node gives you the relation term - document without a frequency. Try to use grouping node and group by term and document and count. It seems you always get 1.

ad 2:
The document object is created in Stings to Document with title = col0 and full text = col0
This results in TF counting each term twice (title + full text).

ad 3:
IDF calculates: idf(t) = log(1 + (f(D) / f(d, t)), where f(D) is the number of all documents and f(d, t) is the number of documents containing term t.
This means the number of occurances of a term is not included, but only the number of overall documents and the number of documents containing the term.
Since you have only two documents you will end up with maximum two values for IDF (1 document of 2 or 2 documents of 2).

ad 4: the size of the term in the tag cloud is depending on the setting value column in Tag Cloud node.
This value is the same for all your terms. If you need absolute frequencies you would need to use as value column the result fo TF node.

hi Dnreb, 

thanks a lot for your reply. 

Ad 1 + 2: 

You gave me the right hint with the double counted terms with the "title" and the "document". I now just use numbers for the title, so that it isn't use twice. With the GroupBy node I group by terms and sum up the TF values I now receive the "right" results for. eg. with the term "pleasure". However if I ony group by terms I cannot use the Frequency Filte because this node need the document column. So if I additionally group by document, it splits again ("pleasure" has a TF abs. of 7 and a second TF abs. of 2). 

When I apply the Tag Cloud to each of this, in this case I receive the same outcome (see TagCloud #1a and #1b). Why is it so? For e.g. the TagCloud #1a: Is the Tag Cloud so "clever" to use the term "pleasure" just once (in the previous nodes the term "pleasure" appeared twice)? Or why is it so? 

If I do so with another input file I get a different result. So my additionally question ist: how is it possible to apply the frequency filter node (or any other node) if I want to use the groupby node? 

Ad 3:

Thanks a lot for this explanation! I think I know might understand... 

Ad 4: 

So using the IDF doesn't make sense in this case, does it? Which node do you use most oftenthe TF, IDF or TF*IDF? Is there any "guideline" when to use which node?

 

Thanks a lot :) 

Jasmin

 

Hi Jasmin,

1.) why do you need the Frequency Filter node? This node is only useful if you want to filter the low frequency terms inside the documents as well. If you just want to filter the bow, e.g. because you are preparing words for a tag cloud, you can use the Row Filter as well.

In your workflow the Frequency Filter has not filtered any terms at all. The number of rows before and after the node are identical.

In your workflow the two data sets for #1a and #1b are quite similar i.t.o. the number of terms (rows) and scores. Thus, the tag clouds will be similar too.

Attached is your workflow with an additional branch that creates TagCloud #5. To group on terms the best is to strip the tags before grouping by using the Term to String node. Group on the strings and sum up the absolute TF  values. When you open the Tag Cloud view, select Tab "Font Style" and change Minimum and Maximum fontsize.

4.) TF, IDF, and TF*IDF are measures to reflect the importance of terms in documents and document corpora. The more documents contain a particular word, the smaller will be the IDF score. IDF makes only sense if you want to identify discriminative terms, that are only contained in few documents of the corpus. If these words occur often in these documents (high TF score) the importance score TF*IDF is high. http://en.wikipedia.org/wiki/Tf%E2%80%93idf

Usually for Tag Clouds TF is used.

 

Here is an introduction how to use the Textprocessing nodes and in which order:

https://www.knime.org/files/knime_text_processing_introduction_technical_report_120515.pdf

I hope this helps.

Cheers, Kilian

Hi Kilian, 

thank you so much for the extensive answer and the good explanation. This really has helped a lot!

Greetings Jasmin