Preprocessing - 'Replacer' and 'Punctuation Erasure' nodes (whitespace)

Hi guys,

referring to https://tech.knime.org/forum/knime-textprocessing/punctuation-erasure, I have encountered an interesting circumstance.

Little backstory: I'm not satisfied with the 'Punctuation Erasure'. E.g.: having a combination like "disk-swapping", once the 'Punctuation Erasure' was applied, it looks like "diskswapping" what would make this string useless for certain occasions. Or something like "connection-180-800-200" which would lead to have a string "connection180800200". In this occasion, the 'Number Filter' won't work, and so on. So, a whitespace would be needed instead of just removing the characters.

The basic solution (Kilian came up with in above posted thread) is to use the 'Replacer' node with either using the Regular Expression [!#$%&'\"*+,.\?:;]+ or even better (since most character are covered) "[!#$%&'\"()*+,./\\:;<=>?@^_`{|}~\\[\\]]+"

Interestingly enough, only the first expression works when using with the 'Replacer' node. In case I use the longer Expression, no punctuation character will be removed at all. Same happens if I combine, lets say, only < and >. That means, for every single punctuation character to removce (except for [!#$%&'\"*+,.\?:;]+ ), I have to create an extra node, which I find weird.

Am I doing something wrong or is it due to a bug?

Also, a "use whitespace instead of a void" option to check for the 'Punctuation Erasure' node would be nice.

 

Thanks,

Manu

Hi Manu,

can you please provide a workflow with example data. I can not reall reproduce the problem or maybe do not really understand what exactly you are writing about.

I tried The Replacer with the long expression "[!#$%&'\"()*+,./\\:;<=>?@^_`{|}~\\[\\]]+" and it works fine on the example data (see attachment). The special characters are all replaced.

Cheers, Kilian

Reminds me a bit of the issues I had due to the tokenization when documents are created:

http://tech.knime.org/forum/knime-textprocessing/french-language-and-knime-text-preprocessing#comment-41694

Just want to hint at the fact that not all issues come from the regex but sometimes from the underlying data and that it may be useful to visualize the terms using the bow transformation (kilians advice in the above thread).