Text pre-processing fails after updating deprecated 2.9 nodes to corresponding 3.3 nodes

I want to successfully run a text processing workflow developed in KNIME 2.9 under KNIME 3.3.  When I run the v2.9 workflow in the v3.3 environment, it runs fine, but KNIME tells me that the majority ot text pre-processing nodes  are deprecated. When I replace the v2.9 deprecated nodes with corresponding v3.3 versions, some of the replacements run extremely slow (70 fold slower) and in some cases fail to perform their function (no filtering).  Let me illustrate the problem via the following example (attached) which shows:

(1) Execution of the v2.9 Punctuation Erasure node (deprecated) takes about 3 sec. Execution with the corresponding v3.3 node takes about 200 seconds.

(2) The output of the Bag of Words Creator node shows 220103 rows.  The output of the v2.9 Punctuation Erasure node (deprecated) shows 209271 rows. This is expected showing that filtering has occurred.  By contrast, the output of the corresponding v3.3 node shows 220103 rows identical to the input.  Thus no filtering has occurred.

(3) The problem is not restricted to just the Punctuation Erasure node.  Execution of the v2.9 N Chars Filter node (deprecated) takes only 3 seconds. Execution with the corresponding v3.3 node takes about 215 seconds (70 fold slower).

(4) Filtering is similarly affected. With the same input as before, the output of the v2.9 N Chars Filter node shows 164532 rows indicated that extensive filtering has occurred. By contrast, the corresponding v3.3 node output shows 220103 rows (same as the input) indicating that no filtering has occurred.

What is going on here? 

--Paul

Hey Paul,

I am not 100% sure about this, but it seems that the row processing for term columns does not exist in later versions. So basically, the term column will not be filtered by the latest nodes. The "deep preprocessing" option for the deprecated nodes does not exist either since it's turned on by default internally. That explains the longer execution time because terms get filtered directly inside of the documents and the documents will be "rebuild". If you use another BoW after the 3.3 nodes, you can see that filtering happened. 

Due to these changes I would recommend to apply the preprocessing nodes before BoW, otherwise you have to use the BoW twice. 

I wish you a merry Christmas!

Cheers,

Julian

Hi Paul,

Julian is right. The preprocessing nodes have been changed from 2.9 on. The old nodes still work with the new version but the nodes are deprecated. This means it is recommended to use the new versions of the nodes. The new versions do not apply preprocessing on bag of word but directly on the documents. The benefit is that larger document sets can be processed and handeled in KNIME.

What you need to do is:

  • Replace the old nodes with the node version.
  • Create the bag of words after all preprocessing nodes (not before).

Cheers, Kilian