Applying LDA topic terms to new documents

AngusVeitch · March 29, 2016, 9:06am

The LDA (Topic Extractor) node seems to work well, but I would like to apply the topic definitions generated from one document collection (e.g. a corpus) to a new collection of documents.

I was wondering how the Topic Extractor node scores the documents to produce the 'Document table with topics' output, and how I might construct a workflow to replicate this process. Despite searching for a while, I haven't found a simple statement of how the document proportions are calculated -- though perhaps this is because I don't understand the statistical notation.

Is this feasible to do in Knime? Or should I be considering other platforms (e.g. R) to do this sort of thing with LDA?

AngusVeitch · March 31, 2016, 9:57am

Ok, so it seems this question has already been asked here. And the answer seems to be that it can't be done with the Topic Extractor Node.

Damn.

kilian.thiel · April 5, 2016, 7:51pm

Hi Sugna,

internally the Mallet library is used for the topic extraction and assignment. Very roughly speaking this is more or less based on a SVD of a probabilistic document term matrix. Unfortunately there are no matrix factorization nodes in KNIME so far to reproduce something like this. You could try out the PCA nodes (e.g. PCA compute) or reach our to the R integration.

Cheers, Kilian

Geo · April 6, 2016, 12:27am

You could try the package "topicmodels" in R, however, you'll need to get acquainted with the "tm" package first to put your documents into a Corpus and from there into a DocumentTermMatrix.

AngusVeitch · April 6, 2016, 7:00am

Thanks for your replies.

I've started dabbling with the topicmodels package in R, with some initial success. And rather than using the tm package for the preprocessing, I worked out how to import document-term matrix generated in Knime. So it's nice to know that I can use both to take advantage of their respective strengths.

kilian.thiel · April 21, 2016, 11:54am

Hi sugna,

that sounds really nice. Could you share a workflow with a bit of data (attach it to this thread). I would really like to see that solution.

Cheers, Kilian

system · June 2, 2023, 9:48pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.