I have a question about the Topic Extractor Node:
In the configuration of the node I can adjust the number of iterations.
Does this mean the number of iterations, that is done for each document?
Or is this the maximal number of topics that can/ should be handled with this node?
Or is this something else?
Thank you for your help.
that is the number of iterations for all documents. Topics are extracted in several iterations from all documents. This number specifies the maximal number of iterations.
The Topic Extraction node has iteration statistics at its third output. You can see a log likelihood value for iterations which can be interpreted as a convergence score of the algorithm. Attached is an example workflow with a Topic Extraction node and a Scatterplotter. The plotter plots the score values. It can be seen that there are two jumps of the log likelihood score. The first one after ~10 Iterations and the second after ~290 Iterations. After 290 iterations the score does not change dramatically anymore. This means that ~300 Iterations are enough to extract the topics from the given data set.
I have tested your example workflow and I have read the theorie of the log likelihood and maximum likelihood.
Can you please explain to me why the log likelihood is negative in your example and waht the log likelihood exactely tells us. Is it the likelihood that the extracted topics fit for the number of analysed documents or what is the relation between the topics, the iterations, the analysed documents and the log likelihood?
Thank you for your help! :-)
the node bases on the ParallelTopicModel implementation from Mallet. It is described in detail including the parameters in the Efficient Methods for Topic Model Inference on Streaming Document Collections paper.
You can also have a look at its source code especially at the modelLogLikelihood() method that describes at the beginning of the method how the likelihood is computed. The KNIME column contains the result of this method divided by the number of total tokens.
Hi @tobias.koetter @kilian.thiel
Thank you for indicating these documents. I was just wondering why does the Knime implementation divides the log likelihood by the total number of tokens? I have noticed that independent of the use (LDA for online reviews, social media, books), I get a value ranging from -6 to -8.
And one more question please. I have noticed the “number of threads” value in the node configuration adjust automatically depending on the computer I am using. Is it related with number of cores or RAM? Also, in the node description about “number of threads” it says that it divides the document collection by the desired number of threads. By document collection, does it refer to the amount of documents I have (e.g., books)? I asked because even in a project in which I had 40 documents the number of threads was 80, which is more than the number of documents.