TF-IDF by document

karthe · September 10, 2015, 7:32am

Hi all,

I want to calculate the TF-IDF by document. I dont know how to do this

I am able to do this by term. But dont know how to do that by document. Kindly advice me. This will help me to compare a document with another document.

Thanks,

Karthikeyan P

Iris · September 14, 2015, 11:04am

the tf-idf cannot be calculated for a document. It tells you how often a term occurs in the set of documents divided by how often in occurs in any document.

For comparing documents I would point you to our example server, Kilian made some very nice examples there e.g. the 009002_DocumentClustering generates a vector for each document and than clusters them.

Best regards, Iris

kilian.thiel · September 27, 2015, 12:55pm

Hi,

in case you haven't found a solution, attached is an example workflow, showing how to count TF, IDF values and multiply them using the Math Formula node.

Cheers, Kilian

tagcounting.zip

karelman · September 28, 2015, 12:16pm

I make it this way: you need 3 nodes, TF, IDF and a Java Snippet in which you make the calculation.

// Enter your code here:
out_TFxIDF = c_TFabs * c_IDF;

Geo · September 28, 2015, 10:38pm

Thank you for the example. I have only two questions:

- Why do you use relative TF instead of absolute TF ?

- Is there any way not to count the terms in the title ?

kilian.thiel · September 29, 2015, 8:42pm

Hi Geo,

- for the well known TF*IDF value (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) the relativ TF is used. The absolute value is not that meaningful.

- no. A workaround would be, not to set the title or use a dummy title string, e.g. an ID.

Cheers, Kilian

Geo · September 29, 2015, 11:50pm

Thank you, Kilian.

I guess the text processing features depend a lot on what one intends to do. I've used them to perform some transformations and mining on a single text column (no title, no authors) for a supervised classification task. I found that for this kind of exercise the current document class seems to be a tad too complex. Maybe it would be worthwhile allowing the possibility in the String To Document node to set some options to "none" (e.g. title, authors, etc.) instead of having to choose an empty string variable in each case.

kilian.thiel · September 30, 2015, 12:39pm

Hi Geo,

thank you for the feedback. I agree that the document creation requires too many mandatory fields and I will think about making more fields optional.

Cheers, Kilian

system · June 2, 2023, 9:49pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.