I want to calculate the TF-IDF by document. I dont know how to do this
I am able to do this by term. But dont know how to do that by document. Kindly advice me. This will help me to compare a document with another document.
the tf-idf cannot be calculated for a document. It tells you how often a term occurs in the set of documents divided by how often in occurs in any document.
For comparing documents I would point you to our example server, Kilian made some very nice examples there e.g. the 009002_DocumentClustering generates a vector for each document and than clusters them.
Best regards, Iris
in case you haven't found a solution, attached is an example workflow, showing how to count TF, IDF values and multiply them using the Math Formula node.
I make it this way: you need 3 nodes, TF, IDF and a Java Snippet in which you make the calculation.
// Enter your code here:
out_TFxIDF = c_TFabs * c_IDF;
Thank you for the example. I have only two questions:
- Why do you use relative TF instead of absolute TF ?
- Is there any way not to count the terms in the title ?
- for the well known TF*IDF value (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) the relativ TF is used. The absolute value is not that meaningful.
- no. A workaround would be, not to set the title or use a dummy title string, e.g. an ID.
Thank you, Kilian.
I guess the text processing features depend a lot on what one intends to do. I've used them to perform some transformations and mining on a single text column (no title, no authors) for a supervised classification task. I found that for this kind of exercise the current document class seems to be a tad too complex. Maybe it would be worthwhile allowing the possibility in the String To Document node to set some options to "none" (e.g. title, authors, etc.) instead of having to choose an empty string variable in each case.
thank you for the feedback. I agree that the document creation requires too many mandatory fields and I will think about making more fields optional.