Hi,
May I ask the Text Processing developers to revisit the regular expression handling? I realized that for example the Punctuation Erasure node (in KNIME 3.6.1) give huge GC pressure (with simple streaming), while other nodes in simple preprocessing were fine. Checked the sources and it seems the pattern is always recompiled, matched. I would suggest something along the following change:
private static Pattern punctMarks = Pattern.compile("[!#$%&'\"()*+,./\\:;<=>?@^_`{|}~\\[\\]]+";)
private String punctuationFilter(final String str) {
Matcher m = punctMarks.matcher(str);
return m.replaceAll(replacement);
}
Similarly the Replacer node’s performance can also be improved in my opinion.
Thanks, gabor