TF-IDF

I would like to calculate the TF-IDF for all words (except stopwords and punctuation) in various documents.

I have tried several things but have not gotten anyway near what I want I want have. I know that I probably have to use the TF and IDF node and need to multiply them in a java snippet... but I cannot get it to work.

Does anyone a workspace example or link to something that works. Or any tips/ideas?

Thanks!

After parsing your documents (or converting strings to documents) you need to use the BOW creator, to create a bag of words. This bow can be filtered by e.g. a stop word filter, punctuation erasure or number filter etc. On the terms in the (filtered) bow TF and IDF nodes can be applied and finally the Java Snipped (or Math) node to multiply the fields.

Attached you find an example workflow in which TF*IDF values are computed for the terms of the Document Clustering and Classification example of the examples section (see: http://tech.knime.org/examples). The workflow runs with the latest knime version 2.6.1.

Cheers, Kilian

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.