Ngram Creator in Document Vector and LDA

sschacht · November 28, 2020, 8:14am

Hi Everybody,

I hace a quuestion regarding the Nodes “ngram creator” Document Vector and LDA.
I tried to cluster documents using LDA using the text processing nodes of Knime. At the moment I try to enable the LDA to use also compound words. Therefore I would like to create 2-ngrams and feed it to the LDA Node. The LDA should not only use the 2-ngrams rather all other words and in addition the 2-ngrams .
At the moment I have no Idea how to get these both nodes work together. Normally (outside Knime) I would create a document vector and feed this into LDA. In this case there a ways to include ngrams by creating document vectors. But how could this be done using Knime?

Another Question to the LDA Node. Does the Node by itself create a document vector during execution? And if yes what does the Node do if we feed it with a created document vector (document vector node)?

Tanks for your help.
BR

julian.bunzel · November 30, 2020, 9:56am

Hey @sschacht,

the LDA node does not need a document vector as an input. You only need a table that has a document column. To create compound words you could use the Dictionary Tagger with a predefined list of compound words to combine them, so that the LDA node can handle them. You can also use the NGram Creator to build your dictionary. However you would need to filter the outcome. For example only take the top X most occurring n-grams.
The single words within the compound words would not be part of the calculations of the LDA afterwards, just the compound word itself. By doing so, the LDA would work with single words + all the compound words that were created by using the Dictionary Tagger. It is not possible to use all single words + all n-grams for the calculations though.

BR,

Julian

system · December 7, 2020, 9:56am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.