Topic Quality Measures

Friends:

I have been using LDA node to identify and understand the latent topics.  As you are aware, it has options in terms of both the number of topics, as also the number of terms per topic.

In this context, I was wondering:

1.  if sensititivity of the LDA to addition or deletion of terms could be interactively determined; and

2.  ensuring that a part of the sample statements is compulsorily to be grouped in to a specified topic  and hence the model is to be suitably altered for enabling appropriate prediction; and

3.  whether there are any measures on the quality of the topics, so that the right combination could be chosen.  If there are any rules of thumb in this connection any of you use, that would also help, especially in the absence of the measures on quality.

More importantly:

4.  What is the sample size recommended for training and then prediction?  The population size of the statements is in the region of 60k.

5.  Which are the other nodes or methods you would consider?

Thanks in advance for the sharing of knowledge in response.

Cheers

Hi Sridhar,

1. there is no possibility change the parameters and inspect the results interactively. However, it is possible to loop over a parameter range and thus run the LDA with different parameters. The result can be collected and inspected later on. Furthermore the Optimization Extension (labs) provides loop nodes to optimize parameter settings by exhaustive search if there is a target function to optimize.

2. this is not possible.

3. there is a log-likelihood measure for each iteration indicating the convergence of the iterative process. Additionally each term has a weight assigned indicating the "membership" to a topic. There is no single measure of quality.

4. the Topic Extractor LDA node is not a classifier. What do you mean with training and prediction?

5. there are other nodes/ways to extract important keywords, such as the Keygraph Keyword Extractor or the Chi-Square Keyword Extractor. Additionally you could compute TF*IDF values for terms and filter those with the highest values using the Frequency Filter node.

For details about the LDA algorithm please read:

http://people.cs.umass.edu/~lmyao/papers/fast-topic-model10.pdf

The node integrates the Mallet lib for LDA computation:

http://mallet.cs.umass.edu/

Cheers, Kilian

Third party inspections are readily available in countries like India and China. The quality and professionalism are highly variable. Meet the local manager of the company you choose, and also try to meet with the inspector.

Quality Audit Service In China