I am just a newcomer to Knime, so excuse me if I am posting a question that already has an answer. I want to create a workflow that allows me to automatically detect topics from abstracts of a bunch of scientific abstracts retrieved from Scopus. What would you recommend? Thanks in advance!
We have an example of how to do topic detection here: https://www.knime.com/nodeguide/other-analytics-types/text-processing/topic-detection-lda
We also have a video explaining how to build such a workflow on our Youtube channel: https://www.youtube.com/watch?v=upAwDcw9ra4
Thank you so much for your answer, but it doesn’t help me. Suppose I want to preprocess a bunch of abstract that I got from SCOPUS database and I found a lot of meaningless words such as “Elsevier,” “Ltd,” “Mary Ann Liebert,” “Wiley,” “Taylor and Francis,” etc. How can I remove these specific words from the texts?
I’d recommend creating a ‘stop word’ list - this list would include such terms which are really not adding any value to your topics. You can bring in your stop word list via the node ‘Stop Word Filter’. Ideally you should use stemmed stopwords where possible. The stop word list should be a .txt file. If your stop words are not global, you may want to create separate stop words for separate categories e.g. for medical articles, ‘medic’ may be a stopword, but this is not relevant for technological articles, etc.
You can also use the NE Tagger (OpenNLP and Stanford NLP) nodes to identify named entities, and then eliminate these entity tags (Tag Filter node) before doing your topic modelling.
Hope this helps?
You could remove such terms based on term frequency, the assumption being that given names in abstracts would not be frequent (e.g. TF = 1).
Naturally, stop word filtering is useful for removing frequent insignificant words. A custom stop word list would be useful for your specific task here above if such a list was easily obtained…
Just now realised that you want to actually perform topic detection. There is IMO no way past the KNIME text processing nodes.
Proper text processing such as removing punctuation, transform to lower case, stemming and removing stop words is recommended to separate the noise from the signal.