Parallel LDA - understanding results

RIPR87 · January 14, 2020, 9:16pm

Hello, I am currently using Parallel LDA in my automatic topic detection workflow. In general it strikes me as really useful and with the right settings it is able to understand given Documents and deduce their topics. However, there is one thing I am not entirely sure I understand correctly.

One of the results I get after running the node is topic weight. How should I interpret this metric? I suspect, that if I use a bunch of 1,000 documents, the maximum weight is 1,000 - is this correct? If so, and one of the topics Parallel LDA detects has weight of 700, does that mean, this certain topic is likely to appear in 700 of the given documents?

Hopefully I understand the concept of the Weight correctly, but if I don’t, could anyone explain it to me better?

Thank you very much,
Daniel.

gnime · January 17, 2020, 10:37am

Hi David,

If I understood the LDA correctly the individual topic weight is the probability that an individual document is about a specific topic. The reason for this is, that the LDA regards Documents as a mixture of topics. So each to document has a varying degree of affiliation to a specific topic.

Hope this helps.

RIPR87 · January 17, 2020, 10:46am

Thank you very much for your answer. That’s what I though, however I have problem with understanding the final number, which can be, as I’ve written before, for example 700 in a project with 1000 documents. A probability should be on a scale from 0 to 1, or from 0 to 100 in case of percentage if I’m not mistaken.

So, if each document is assigned a probability of containing the found topic, does this mean that these probabilities are then summarised into one number? For example that I’ve found 700 documents with probability of 100 %?

gnime · January 17, 2020, 10:48am

Hi Daniel,

I am not sure which numbers you are talking about, do you have a screenshot of your results maybe?
Thanks.

RIPR87 · January 17, 2020, 11:03am

Hello, unfortunately I don’t have access to the computer with my results right now, but I will give you an example. Hopefully it will be sufficient.

I am working on a project working with book summaries which in general are sorted into groups of approximately 1,000 documents - according to genres, years, languages etc. I’ve prepared the data to be used with Parallel LDA and I am searching for four words per topic. After applying the node, I get results like:

Topic_0: cumulative weight = 700

princess, weight = 190
dragon, weight = 130
sword, weight = 180
hero, weight = 200

How should I interpret the weights? Does this mean, that in general there is 70 % chance of finding Topic_0 among the Documents and at least 13 % of all Documents contain dragons?

Thank you once again, you are big help for me

Edit: The groups contain approximately a 1,000 documents, not 10,000. I’ve edited this mistake out.

gnime · January 17, 2020, 11:24am

Hi Daniel,

So the individual weights that are assigned to words describe the signifinance this word has when generating a specific Topic (so just for Topic_0 in your case). So in your example, when a book summary is mainly about the princesses and heroes, these words will get a high score, since they are prevalent in the summary. These words will then be associated with Topic_0.

However, as far as I understand the LDA you cannot derive a procentual distribution of words and topics over documents from these scores.
For this you could count the frequency of words and topics for your document corpus.

I hope this helps, because explaining LDA is a bit tricky in my opinion.

RIPR87 · January 17, 2020, 12:30pm

Hello, I understand that the bigger the weight is, the more significant the word is in context of the said topic. What I still don’t get (and thanks again for trying to explain it to me) is what that number really means.

How is it calculated?
I believed it means that in general the result number is the number of Documents about dragons and heroes, even though some Documents are only 20 % about dragons and heroes, whereas some are more like 80 % about dragons and heroes.
This way all Documents could mention dragons and heroes, but in general you could say that 70 % of their contents (because the weight of 700 out of 1,000) is about dragons, princesses, swords and heroes, while the rest is about something else completely.
Also it doesn’t mean that the same 70 % could not be about magic, wizards, villains and potions as well, because topics can overlap.
What is its max value?
Is it the total amount of documents? If so, it would make better sense to me.
Why is it summarised to get the topic weight?

Basically my biggest problem with LDA is understanding what the weight really means. The topics I got make sense to me and the bigger their weight, the more important and prevalent they are in the summaries. The number itself is still a mystery to me though.

badger101 · June 3, 2020, 5:14am

Hi @RIPR87 I happen to come across your post as I’m searching in the forum on anything related to LDA. I’m learning about this from scratch, and I had a similar question about weight too. But I think I have found the answer, coming from this official explanatory video from KnimeTV

I believe the built-in Knime Parallel LDA node’s 2nd output uses the TF-IDF algorithm, but I will have to double-check on that myself later for my own knowledge. (Or perhaps someone can verify this for us.)

As for the calculation of TF-IDF weight score, and the reasoning behind it, you can simply Google the answer as I am no expert, but I understand the logic behind it from what I’ve read so far.

deicide_bg · December 9, 2020, 11:08am

Some additional clarification from KNIME would be helpful.

julian.bunzel · December 10, 2020, 2:21pm

Hey all,

I had a look at the code and as far as I understand it, the weight is actually a count based on how often a word was assigned to a specific topic.

Based on the example above, the word princess was assigned to Topic_0 190 times. If we sum up all weights for a specific term, we should get the absolute term frequency for this term. (Summing the weights for the terms would not be possible though, since the table only gives us the top k terms per topic, so we won’t see all weights for each term.)

Since the weight is based on the frequency, it might make sense to normalize it for example by dividing th e weight of a term by the total term frequency of the term.

Best,

Julian

system · June 2, 2023, 9:41pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.