Does anyone have tips on how to run the Keyword extractor node efficiently? I am running this node on a subset of my data (~500 documents) and it is crashing the program everytime. Is there something that I need to do to get this to run smoothly?
Also should i run my pre-processing nodes such as "punctuation erasure", "n chars filter" after the "bag of words node"? I see that if I run the pre-processing nodes independently from the bag of words node all my nodes run much faster.
Appreciate all the help. Thanks.
the keygraph algorithm is unfortunately computationally expensive. What are you trying to achieve with the extracted terms?
If you want to build a document vector out of the extracted words I suggest a different approach to extract the words. Use the preprocessing nodes directly on the documents (as they are created e.g. by the Strings to Document node) then create a bag of words (after preprocessing) and filter the terms according to document frequency. Filter out terms that occur only in less then 1% of the document. Then create the vector. See e.g. the sentiment analysis example workflow on the KNIME Example server.
If you want to extract these word e.g. to create a tag cloud you could first apply preprocessing nodes, then create a bag of words, compute tf, idf, and tfidf values and filter based on tfidf values. Use the result for your tag cloud. This is faster then the keygraph node.
Does that help?