TextProcessor more efficient Regex handling

Hi,

May I ask the Text Processing developers to revisit the regular expression handling? I realized that for example the Punctuation Erasure node (in KNIME 3.6.1) give huge GC pressure (with simple streaming), while other nodes in simple preprocessing were fine. Checked the sources and it seems the pattern is always recompiled, matched. I would suggest something along the following change:

private static Pattern punctMarks = Pattern.compile("[!#$%&'\"()*+,./\\:;<=>?@^_`{|}~\\[\\]]+";)


private String punctuationFilter(final String str) {
    Matcher m = punctMarks.matcher(str);
    return m.replaceAll(replacement);
}

Similarly the Replacer node’s performance can also be improved in my opinion.
Thanks, gabor

4 Likes

Hey gabor,

thank you for the hint. I will have a look and create a ticket for this.

Cheers,

Julian

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.