I am just a newcomer to Knime, so excuse me if I am posting a question that already has an answer. I want to create a workflow that allows me to automatically detect topics from abstracts of a bunch of scientific abstracts retrieved from Scopus. What would you recommend? Thanks in advance!
Thank you so much for your answer, but it doesnât help me. Suppose I want to preprocess a bunch of abstract that I got from SCOPUS database and I found a lot of meaningless words such as âElsevier,â âLtd,â âMary Ann Liebert,â âWiley,â âTaylor and Francis,â etc. How can I remove these specific words from the texts?
Iâd recommend creating a âstop wordâ list - this list would include such terms which are really not adding any value to your topics. You can bring in your stop word list via the node âStop Word Filterâ. Ideally you should use stemmed stopwords where possible. The stop word list should be a .txt file. If your stop words are not global, you may want to create separate stop words for separate categories e.g. for medical articles, âmedicâ may be a stopword, but this is not relevant for technological articles, etc.
You can also use the NE Tagger (OpenNLP and Stanford NLP) nodes to identify named entities, and then eliminate these entity tags (Tag Filter node) before doing your topic modelling.
You could remove such terms based on term frequency, the assumption being that given names in abstracts would not be frequent (e.g. TF = 1).
Naturally, stop word filtering is useful for removing frequent insignificant words. A custom stop word list would be useful for your specific task here above if such a list was easily obtainedâŚ
Just now realised that you want to actually perform topic detection. There is IMO no way past the KNIME text processing nodes.
Proper text processing such as removing punctuation, transform to lower case, stemming and removing stop words is recommended to separate the noise from the signal.