I have used KNIME for a long time and never had any problem with the Text Processing filters. But recently I installed 3.1 and 3.2 on different machines and in both cases the filters (Punctuation, Numbers, N-Char, etc.) are not working (the number of rows after BoW creation remain the same and nothing is removed). The depreciated filters of the previous version work but the new ones are not working. Am I the only facing this issue?
with version 3.1 all preprocessing nodes of the Textprocessing extension have changed. The old nodes are still available for backwards compatibility as "deprecated" nodes. However, the new nodes do not support BoW filtering anymore. The filtering is applied directly on the documents. The reason for this is, that this direct filtering of documents is much faster than the filtering of a BoW. Direct filtering means, that the terms that are filtered, are filtered directly in the document itself.
This means, that you need to apply the filters before creating the BoW, e.g. Strings To Docs->Stop Word Filter->Number Filter-> .... Filter->Bag of Words Creator
This will result in a BoW containing only non filtered terms.
The sequence works as you have suggested but I have this feeling that the "stop word filter" node, when used with the provided list, is not working. I have provided different words in a text file but none of them is removed.
at 1. April we released a noew version. Please update to that version (3.1.2) and try this out again. Custom stop word files can be used in the stop wod filtr node, which also support the knime:// protocl now. Make sure that the words to filter out have not been set unmodifiable before by a tagger node. You can als ignore this unmodifiable flag in the filter node.
I have the exact same problem. None of the filters work at all. That despite copying an example file’s pre-processing part. Nothing has been removed: numbers, stop words, punctuation, nothing. Not even the regex filter does anything. The pre-processing bit starts with a string-to-document node and ends in the the BoW node. What am I doing wrong???ENRON_dateset 2.knar.knwf (32.4 KB)
The preprocessing pipeline you have used creates a new column called Preprocessed Documents, but the Bag Of Words Creator node still uses the initial unfiltered Documents column to create the bag of words. Open the node dialog of the Bag of Words Creator node and select the Preprocessed Documents column as document column. Then it should work.
Can you provide an example? Maybe it is the same problem @alkopop79 has. Please have a look at the node configurations / node dialogs to check if you have selected the correct document column for preprocessing and bag of words creation.