Topic modelling

MvBreemen · June 10, 2020, 3:13pm

I have “trained” a topic model for the first 11 months of 2019 of news-headlines (ABC “a million headlines” Kaggle dataset). Now I want to test how well the december headlines fit the 10 topics discovered for the first 11 months. These scores should give an indication for each headline as to whether they fit the topics well or not. If not, this may be an indicator that a new topic is emerging, sort of an early warning system. What would be a way to do this? I cannot find an obvious node to do this. In R, a measure such as “perplexity” would be something I would use, how “perplexed” are the existing topics by the new -previously unseen- topics.
Any suggestions?

julian.bunzel · June 16, 2020, 12:27pm

Hey @MvBreemen,

sorry for the late response. Unfortunately, we don’t have a dedicated node to do this, but I will try to create a component that calculates perplexity. I will come back to you in the next days.

Best,

Julian

badger101 · July 21, 2020, 5:30pm

Hi @julian.bunzel , I too need a node for perplexity. I am using the LDA topic modeling node. I am now trying to read on journals on how to calculate perplexity and make a graph like the Elbow method. I dont know coding so its really helpful if there’s a node for this with the input being the LDA node’s output.

julian.bunzel · July 23, 2020, 9:19am

Hey @badger101,

it’s still on my list to create a component to calculate perplexity, but it’s also a good idea to have a dedicated node for that in the future. I will create a ticket for that.

Cheers,

Julian

julian.bunzel · July 23, 2020, 10:59am

Hey again,

after digging up some old threads it seems that we can simply use the Math Formula node to calculate the perplexity.

MALLET usually does not provide perplexity but what it does provide is the LL/token (log likehood/token) as output to std::out (you can see this in the KNIME console if you have set the console log level to debug in the preferences). Luckily, KNIME captures these values (per iteration) in the third output table of the Parallel Topic Model node.

Based on the LL/token values we can calculate the perplexity by using the Math Formula node with the expression 2^(-$Log likelihood$) (1, 2).

Cheers,

Julian

system · January 21, 2021, 11:06pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.