Custom stop list for different occasions

supersharp · May 3, 2018, 9:03pm

Hi everyone,

New to the forum so please excuse any problems/gaps with my question.

I am looking for a way to automate the selection of different stop word lists, when different situations are at hand.

For example, if I am analyzing weather situations, ‘weather’ itself has the highest frequency and hence this needs to excluded. But if I am analyzing property damage, ‘weather’ would not appear in the stop list in that instance.

I understand the custom stop word list is a .txt format which can only contain one column, so how could I achieve such a dynamic stop-word selection?

Thanks for your help!

RAPosthumus · May 4, 2018, 6:34am

I use two stopword files one specific for the task at hand and a generic list (in my case dutch).
The file to be used for the specific case is specified by a string input node, so that it is a simple task of pointing to the right file. See diagram below
knime_2018-05-04_08-31-49

RolandBurger · May 4, 2018, 7:40am

Hi supersharp,

In addition to RAPosthumus’ solution, you could also try frequency-based filtering. Please have a look at this workflow: https://www.knime.com/nodeguide/other-analytics-types/text-processing/topicextraction-with-the-elbowmethod

The second “Preprocessing” metanode in the center of the workflow contains a routine to filter terms that occur either very rarely or very often. Since you’re going for terms that occur in most documents, you’ll be able to catch them using this metanode.

BTW, RAPosthumus’ solution uses the new Stop Word Filter node that is currently only available in the nightly build, in case you were wondering why your node doesn’t have a second input port

Cheers,
Roland

RAPosthumus · May 4, 2018, 8:08am

Thanks Roland, learning something new every day here at the forum.
I use a two pass method: first run over all the words, calculating for every word a rank.
Next calculate a rank range (e.g. the words with two or three highest ranks) and filter on those for further processing.

supersharp · May 4, 2018, 5:09pm

Thanks so much RAPosthumus and Roland - both solutions would work well!

Out of interest, the ‘new Stop Word Filter’ RAPosthumus used is almost exactly what I was hoping for - it would allow me to loop through the file reader and select the task-specific stop word list needed for the case (without knowing Java coding). Please advise how best to get this double input stop filter - I am not sure what it means by it being only available in the nightly build.

Thanks!

RAPosthumus · May 4, 2018, 7:22pm

Read here on nightly builds.

julian.bunzel · May 5, 2018, 11:55am

There is also a possibility to get the nightly build without using the update site.
https://www.knime.com/form/nightly-build
Please read the disclaimer.

Cheers,
Julian

system · May 12, 2018, 11:55am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.