keygraph keyword extractor

madgpap · January 29, 2013, 5:55pm

Hi there,

The Keygraph keyword extractor node is really torturing me. Although I make sure that I deep pre-process the documents when I filter out numbers, punctuation, short terms, stop words etc, it always outputs terms like 'and' and 'in' as the most significant ones!

It's does it even in the Document Classification example that you host here.

Have you seen anything like it?

Many thanks!

George

kilian.thiel · January 29, 2013, 7:29pm

Hi George,

if deep-preprocessing is checked (checked by default) the preprocessing procedure is applied on the terms in the documents as well. The bag of words contains two columns, one with the preprocessed documents and the other with the original documents. Are You sure that You are using the preprocessed documents as input document column for the keyword extractor? You can specify the document column to use in the dialog of the keyword extractor. The column "Document" contains the preprocessed documents (if deep-preprocessing is applied), the column "Orig Documents" contains the original documents.

I just tried the keyword extractor node on a preprocessed bag of words (stop word filter, etc. ...) and filtered terms have not been extracted.

Cheers, Kilian

madgpap · January 29, 2013, 8:45pm

Hi Killian,

Thanks for the quick reply. This is the result I get from that node when I run the DocumentClassification / Clustering examples (see attached).

I have not changed anything in the workflow, I just clicked run. I also made sure that the node works on the preprocessed documents, not the original ones (orig_document).

I have tested this in KNIME 2.7.1 on both Mac and Win.

Thanks again,

George

screen_shot_2013-01-29_at_19.40.37.png

kilian.thiel · January 30, 2013, 7:06pm

Hi George,

this is really weird, i can not reproduce this behavior with KNIME 2.7.1 on Win using the Document Classification workflow from the website. Like you i just execute the keygraph keyword extractor node. The extracted terms contain no stopword etc. Could you please send me the executed and exported workflow (right click workflow, export, with data) to Kilian.Thiel ( a t ) uni-konstanz.de, so i could further investigate on this. The workflow archive file will have some MBs with data, the best would be to use a service like: https://www.transferbigfiles.com/ .

Thanks, Kilian

madgpap · January 30, 2013, 10:07pm

Hi Kilian,

Check your mailbox.

Many thanks,

George

kilian.thiel · February 1, 2013, 11:58am

Thanks, i see your point and will dig into this. It seems like a bug in the document deep-preprocessing. The extracted terms should have been filtered.

Cheers, Kilian

system · June 2, 2023, 9:50pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.