Just looking for sharing some ideas around Topic Extraction. I've been testing this node, but I'm not really satisified with results, so I'm looking for some tips.
When I look the documents that are assigned a specific topic (and here I'm already filtering documents that have more that 0.9 as weight for a specific topic), it still don't look optimized. Right now I'm working with 4 words per topic, and here's an example:
word1, word2, word3, word4
I would expect that the 4 words would appear in the document, but there are lots of documents only showing word1 or even word2.
Any tips for better results? More words per topic? Change the Alpha and Beta variables?
The documents are assigned to a topic based on their similarity to other documents. This means that in your case, not every document has to include each of the four words that were extracted for each topic.
By increasing the words per topic, you will have a greater chance that the extracted words appear in a given document.
if you want your topics represented by terms that occur in almost all documents of one topic you can filter the documents before applying topic extraction.
1. standard preprocessing + topic extraction to find groups of documents that belong to the same topic
2. Loop over each group (=documents belonging to one topic) and count frequencies, than filter based on freqs.
3. Concatenate filtered docs again and apply again topic extraction