Combining ngrams and Keyword extractor?

Hi,

I'm building a workflow to extract relevant terms out of documents.

On the one hand I use the keyword extractor, which extracts single terms only. On the other hand I work with the ngram creator and term/ngram frequencies, to extract terms and ngrams. The second way lacks in performance against the keyword extractor. So, is there a way to combine both ways? To extract single term keywords, as well as multiple term keywords?

 

thank you!

sehom

I'm currently reading the paper KeyGraph: Automatic Indexing by Co-occurence Graph based on Building Construction Meaphor by Yukio Ohsawa et. al, which is the base for the KeyGraph keyword extractor.

Generating phrases (relevant sequences of words as mentioned in my first post)  is described as a part of preprocessing prior to the actual keyword extraction.

I guess, this process would greatly improve my workflow. Any suggestions on how to build this process in Knime prior to applying the KeyGraph keyword extractor on a document?

 

thank you

sehom

Hi Sehom,

I see your point. So far there is no dedicated node to detect and combine multi words to one term. A possible way to do this is to compute ngrams with frequencies and see which ngrams occur often enough to be considered as a multi word term. Then extract a list (data table of one column containing these multi word terms) and use this list as a dictionary for the Dictionary Tagger node. Use this node to tag the original documents based on this list. This will group the multiple words together to terms. Now you can apply the Keygraph Keyword Extrator since this is based on the terms inside the documents.

Cheers, Kilian

 

HI Kilian,

thank you very much! It works great.