TextProcessor more efficient Regex handling

text-processing

#1

Hi,

May I ask the Text Processing developers to revisit the regular expression handling? I realized that for example the Punctuation Erasure node (in KNIME 3.6.1) give huge GC pressure (with simple streaming), while other nodes in simple preprocessing were fine. Checked the sources and it seems the pattern is always recompiled, matched. I would suggest something along the following change:

private static Pattern punctMarks = Pattern.compile("[!#$%&'\"()*+,./\\:;<=>?@^_`{|}~\\[\\]]+";)


private String punctuationFilter(final String str) {
    Matcher m = punctMarks.matcher(str);
    return m.replaceAll(replacement);
}

Similarly the Replacer node’s performance can also be improved in my opinion.
Thanks, gabor


#2

Hey gabor,

thank you for the hint. I will have a look and create a ticket for this.

Cheers,

Julian


closed #3

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.