TextProcessor more efficient Regex handling




May I ask the Text Processing developers to revisit the regular expression handling? I realized that for example the Punctuation Erasure node (in KNIME 3.6.1) give huge GC pressure (with simple streaming), while other nodes in simple preprocessing were fine. Checked the sources and it seems the pattern is always recompiled, matched. I would suggest something along the following change:

private static Pattern punctMarks = Pattern.compile("[!#$%&'\"()*+,./\\:;<=>?@^_`{|}~\\[\\]]+";)

private String punctuationFilter(final String str) {
    Matcher m = punctMarks.matcher(str);
    return m.replaceAll(replacement);

Similarly the Replacer node’s performance can also be improved in my opinion.
Thanks, gabor


Hey gabor,

thank you for the hint. I will have a look and create a ticket for this.



closed #3

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.