The LDA (Topic Extractor) node seems to work well, but I would like to apply the topic definitions generated from one document collection (e.g. a corpus) to a new collection of documents.
I was wondering how the Topic Extractor node scores the documents to produce the 'Document table with topics' output, and how I might construct a workflow to replicate this process. Despite searching for a while, I haven't found a simple statement of how the document proportions are calculated -- though perhaps this is because I don't understand the statistical notation.
Is this feasible to do in Knime? Or should I be considering other platforms (e.g. R) to do this sort of thing with LDA?
Ok, so it seems this question has already been asked here. And the answer seems to be that it can't be done with the Topic Extractor Node.
internally the Mallet library is used for the topic extraction and assignment. Very roughly speaking this is more or less based on a SVD of a probabilistic document term matrix. Unfortunately there are no matrix factorization nodes in KNIME so far to reproduce something like this. You could try out the PCA nodes (e.g. PCA compute) or reach our to the R integration.
You could try the package "topicmodels" in R, however, you'll need to get acquainted with the "tm" package first to put your documents into a Corpus and from there into a DocumentTermMatrix.
Thanks for your replies.
I've started dabbling with the topicmodels package in R, with some initial success. And rather than using the tm package for the preprocessing, I worked out how to import document-term matrix generated in Knime. So it's nice to know that I can use both to take advantage of their respective strengths.
that sounds really nice. Could you share a workflow with a bit of data (attach it to this thread). I would really like to see that solution.