Topic Extraction - How to have better results?

Hey there,

Just looking for sharing some ideas around Topic Extraction. I've been testing this node, but I'm not really satisified with results, so I'm looking for some tips.

When I look the documents that are assigned a specific topic (and here I'm already filtering documents that have more that 0.9 as weight for a specific topic), it still don't look optimized. Right now I'm working with 4 words per topic, and here's an example:

word1, word2, word3, word4

I would expect that the 4 words would appear in the document, but there are lots of documents only showing word1 or even word2.

Any tips for better results? More words per topic? Change the Alpha and Beta variables?

Thanks! :)

Gustavo Velho

Hi Gustavo,

The documents are assigned to a topic based on their similarity to other documents. This means that in your case, not every document has to include each of the four words that were extracted for each topic.

By increasing the words per topic, you will have a greater chance that the extracted words appear in a given document.

I hope that helps!

Best,

Roland

You may want to try this tool: http://elcid.demon.nl/form.html it organizes a text into a tree of topic/subtopics plus an automatic summary.

How clean is your data and how did you preprocess it? That’s as important than the algorithm parameters.

Hi Gustavo,

if you want your topics represented by terms that occur in almost all documents of one topic you can filter the documents before applying topic extraction.

1. standard preprocessing + topic extraction to find groups of documents that belong to the same topic

2. Loop over each group (=documents belonging to one topic) and count frequencies, than filter based on freqs.

3. Concatenate filtered docs again and apply again topic extraction

Cheers, Kilian

 

Thanks guys! I'll try your suggestions and see results. Appreciate your help!

Gustavo