Detect known topics and find new topics in existing corpus and new documents


we try to analyze a text corpus of news regarding specific key words.

Here I applied steps to get most frequent terms (TDF, IDF) checked the most relevant terms per document and created a tag cloud as well as a network representation of connected terms via co-occurance counter.

Now we would like to detect which main topics are discussed in this news (e.g. "quality of service", "warranty" etc.) where each document could include several topics (but not just terms and key words).

We gather regularily new documents in this context where we would like to follow how intensive these detected topics are discussed.

We would like to tag new documents with the topics, to see trends in this know topics.

In the best case we can also detect new topics over time.

For this I was trying to use classification of documents (learning, predicting), but could not get which would be the best way to do this. I am not sure if KNIME is build to support these kind of tasks.

I would appreciate comments, tipps, or also examples how this could be solved.


Hi Bernd,

if you start with a pre defined set of topics that you want to monitor, the classification of documents to these topics is a good way to start. Therefore you have to transform the documents into document vectors and then use any learner node to build a model.

You could also assign keywords to these topics and represent them as document vectors as well and compute similarities between documents and topics (Similarity Search node). Assign a document to a topic if the similarity is higher then a certain threshold (documents could be assigned to multiple topics) and track how the number of assigned documents changes over time to compute the topic coverage.

For unsupervised topic detection the Topic Extractor node is usefule. The nodes extracts a certain number of topics (can bespecified in the dialog) as well as terms that describe the topics.

Hope this helps.

Cheers, Kilian

Hello Kilian,

thanks for this good input. It helps me to move to the next steps!