I successfully used the workflow that you suggested for text mining pdf directory @ https://tech.knime.org/forum/knime-users/performing-text-mining-through-pdf-files
Altough, it works well to text mine a directory of pdf files of my choice, the end result is a tag cloud that includes a wider range of words (terms). Problem is that some of them words are not relevant to my research topic (the directory inlcudes journal articles and therefore the text mining will inlcude words not relevant to the topic like 'journal', 'review', 'issue n.', 'JSTOR' and so on)
Hence, I would need to include a filter node that would enable me to filter out irrelevant words from tag cloud (something like Stop Words Filter Node but with the option to set my own words to be filtered away).
Is there such a node?
Thnx in advance,
you an use the "Stop Word Filter" to upload your own list with words that are not relevant for you. This you'll find at "file options".
I use my own stop word list with every every word in one cell, maybe this works for you too?
Why using the "Stop Word Filter" node again? You already applied it in the Preprosseing part.
Hope I could help you
Thanks for the reply Jasmin!
I am aware that the list of words that should be filtered away should be in comma separated values format (.csv), however this does not work for me, the filtering does not happen. Is it the correct format to list the words?
And, using the 'Stop WordF Filter' node that is already in use would do or shall I create a new one just before the 'Tag Cloud' node?
Hi again Jasmin,
So I figured it out by trial error method. First of all, the .csv file format is correct.
Only one 'Stop Word Filter' node doesn't do the trick becuase I need to use the built in filter for English stop words. Therefore, when you ckeck the box 'Use built-in list' it won't take into account the file that you want to use to customise the filtering.
Therefore another, second 'Stop Word Filter' node was required with unchecked box 'Use built-in list' for the erms of choice to be filtered away.
No idea why it works for you to use built-in filter and cutomised filter at the same time. For me it just didn't. Hope this will help anyone else with similar question.
Thanx for your help Jasmin!
the Stop Word Filter node uses only one list at a time. Either the build-in list or the stop words of the specified file. If you want to filter with two lists, you need to apply two nodes.