Text Processing Filters Not Working

sajjadhaider · January 14, 2016, 8:47am

Hi,

I have used KNIME for a long time and never had any problem with the Text Processing filters. But recently I installed 3.1 and 3.2 on different machines and in both cases the filters (Punctuation, Numbers, N-Char, etc.) are not working (the number of rows after BoW creation remain the same and nothing is removed). The depreciated filters of the previous version work but the new ones are not working. Am I the only facing this issue?

Regards,

Sajjad

kilian.thiel · January 15, 2016, 11:02am

Hi Sajjad,

with version 3.1 all preprocessing nodes of the Textprocessing extension have changed. The old nodes are still available for backwards compatibility as "deprecated" nodes. However, the new nodes do not support BoW filtering anymore. The filtering is applied directly on the documents. The reason for this is, that this direct filtering of documents is much faster than the filtering of a BoW. Direct filtering means, that the terms that are filtered, are filtered directly in the document itself.

This means, that you need to apply the filters before creating the BoW, e.g. Strings To Docs->Stop Word Filter->Number Filter-> .... Filter->Bag of Words Creator

This will result in a BoW containing only non filtered terms.

Cheers, Kilian

sajjadhaider · March 22, 2016, 8:32am

Thanks Kilian,

The sequence works as you have suggested but I have this feeling that the "stop word filter" node, when used with the provided list, is not working. I have provided different words in a text file but none of them is removed.

Regards,

Sajjad

kilian.thiel · April 5, 2016, 7:33pm

Hi,

at 1. April we released a noew version. Please update to that version (3.1.2) and try this out again. Custom stop word files can be used in the stop wod filtr node, which also support the knime:// protocl now. Make sure that the words to filter out have not been set unmodifiable before by a tagger node. You can als ignore this unmodifiable flag in the filter node.

Cheers, Kilian

alkopop79 · May 21, 2018, 11:45am

I have the exact same problem. None of the filters work at all. That despite copying an example file’s pre-processing part. Nothing has been removed: numbers, stop words, punctuation, nothing. Not even the regex filter does anything. The pre-processing bit starts with a string-to-document node and ends in the the BoW node. What am I doing wrong???ENRON_dateset 2.knar.knwf (32.4 KB)

InsilicoConsulting · May 21, 2018, 12:10pm

Me too, especially the stop words filter. Does not work with either inbuilt or external list

julian.bunzel · May 24, 2018, 2:46pm

Hey @alkopop79, @InsilicoConsulting,

The preprocessing pipeline you have used creates a new column called Preprocessed Documents, but the Bag Of Words Creator node still uses the initial unfiltered Documents column to create the bag of words. Open the node dialog of the Bag of Words Creator node and select the Preprocessed Documents column as document column. Then it should work.

Can you provide an example? Maybe it is the same problem @alkopop79 has. Please have a look at the node configurations / node dialogs to check if you have selected the correct document column for preprocessing and bag of words creation.

Cheers,

Julian

alkopop79 · May 24, 2018, 5:37pm

Thank you! Changing the order of nodes helped in the end.

system · June 2, 2023, 9:45pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.