Bigram for LDA

Dear all,

I am trying to use Bigram for LDA: I used NGram Creator to create bigrams, you can see the results I have in the following screenshot.
Question: I added Topic Extractor (Parallel LDA) right after the Ngram Creator node, however, I am not allowed to choose the column “Ngrams” as the Document Column when configruring the Topic Extractor node, is there anyway I can get Ngram Creator and the Topic Extractor node work with each other?

Hi @kwjKNIME -

If there a reason why you’re wanting to create topics based on Ngrams, and not the text of the document itself? As you’ve noticed, the Topic Extractor node is expecting a document type as an input.

It’s not clear to me why creating multiple Ngrams for each document, and then doing topic extraction based on those, is going to provide additional information useful for the topic extraction to work with. If anything, it will muddy the waters considerably.

If your dataset is not confidential, maybe post your workflow and we could have a discussion about your input data vs your desired output.

Hi ScottF,

I appreciate you willing to help, I am actually new to KNIME. I am actually trying to use KNIME to replicate what I got from Jupiter Notebook (see below):

However, I do not know how I can get bigram working for LDA. Any thoughts will be very helpful.

As currently implemented the Topic Extractor (Parallel LDA) node is just going to return single word terms that correspond to topics. If you want to get bigrams involved there is some fairly involved manipulation that needs to happen first, so you can convert a phrase like to “convenience store” into a hyphenated “convenience-store”.

Here is a fairly comprehensive workflow posted by @fvillarroel on the KNIME Hub that demonstrates how you might approach integration of bigrams into the rest of your LDA workflow:

If you are interested in plotting a reduced-dimension version of your dataset, you could take a look at this example workflow. It’s not an exact replica of what you have posted above, but maybe it gives you a place to start. Among other things, it plots clusters by topic based on t-SNE dimensions:


I’m just wondering, if N-Gram Extractor — NodePit
may work for you as it does not require document, just regular string.
Also it works in pare with
Corpus Creator — NodePit

1 Like

I appreciate your inputs, Scott! I will definitely look into this workflow.

Thank you so much, izaychik63, I will check it out.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.