Finding similarity using LDA.

alamsaqib · August 29, 2018, 2:56am

Hello.
I have 5 books in .txt format (Book1, Book2, Book3, Book4, and Book5). Using topic extraction (LDA) if i give a paragraph or two (of course from the above books) to the model and it gives me the results/prediction that the given text is belongs to which book. I already implemented 17_TopicExtraction_with_the_ElbowMethod workflow, but the problem is how should i give the test text which compare it with other books using topic extraction?

Best Regards

Alam

Vincenzo · September 7, 2018, 9:57am

HI @alamsaqib,

The (LDA) Topic Extractor node assigns a topic to each document and generates keywords for each topic.
In other words, it is a probabilistic topic model (unsupervised model) to detect topics from an unlabeled set of documents. The documents are represented as random mixtures of latent topics, where each topic is characterized by a Dirichlet distribution over a fixed vocabulary. The aim is to infer the topics. The scientific article referring that refers to topic extraction LDA is referenced below:
Blei, Ng, and Jordan. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 993-1022

How did you identify that a specific paragraph belongs to a specific book? Does the algorithm extract the title of the books as a specific topics and assign those to the belonging paragraphs?
What’s the next objective of your analysis? Actually, since LDA is an unsupervised model you cannot use it to solve a classification problem. So you should go for an algorithm that solves classification problems (decision tree, logistic regression, etc.).

It could help if you could share the workflow you have been working on.

Hope that helps,
Best,
Vincenzo

alamsaqib · September 8, 2018, 2:31am

Hi @Vincenzo

Thanks for your reply. Yes LDA is unsupervised model, and i am using it for topic detection. I am still working on it, and it needs some changes. After completion for sure i will share my workflow.