The Keygraph keyword extractor node is really torturing me. Although I make sure that I deep pre-process the documents when I filter out numbers, punctuation, short terms, stop words etc, it always outputs terms like 'and' and 'in' as the most significant ones!

It's does it even in the Document Classification example that you host here.

Have you seen anything like it? 

Hi George,

if deep-preprocessing is checked (checked by default) the preprocessing procedure is applied on the terms in the documents as well. The bag of words contains two columns, one with the preprocessed documents and the other with the original documents. Are You sure that You are using the preprocessed documents as input document column for the keyword extractor? You can specify the document column to use in the dialog of the keyword extractor. The column "Document" contains the preprocessed documents (if deep-preprocessing is applied), the column "Orig Documents" contains the original documents.

I just tried the keyword extractor node on a preprocessed bag of words (stop word filter, etc. ...) and filtered terms have not been extracted.

Hi Killian,

Thanks for the quick reply. This is the result I get from that node when I run the DocumentClassification / Clustering examples (see attached).

I have not changed anything in the workflow, I just clicked run. I also made sure that the node works on the preprocessed documents, not the original ones (orig_document).

I have tested this in KNIME 2.7.1 on both Mac and Win.

Hi George,

this is really weird, i can not reproduce this behavior with KNIME 2.7.1 on Win using the Document Classification workflow from the website. Like you i just execute the keygraph keyword extractor node. The extracted terms contain no stopword etc. Could you please send me the executed and exported workflow (right click workflow, export, with data) so i could further investigate on this.

Hi Kilian,

Check your mailbox.

Thanks, i see your point and will dig into this. It seems like a bug in the document deep-preprocessing. The extracted terms should have been filtered.

Cheers, Kilian