Detect all sorts of white spaces in a column

Berti1989 · February 5, 2021, 5:29pm

Hello Knime community,

I have a column in one of my table which contains thousands of domain strings (eg. knime.com). The problem is that it seems that the external input to this column (to which I do not have access) sometimes does not filter white spaces properly (tab, carriage return, etc). And that messes up some of the automated analysis we are doing afterward.

My first approach to tackle this problem was to use a simple String Manipulation node with replace(string, " “, “”). The problem is that, again, that’s treating only one case. I should really be looking to remove all of the followings: ’ ', ’ ', ‘\b’, ‘\t’, ‘\n’, ‘\v’, ‘\f’, ‘\r’, '”’, ‘’’, ‘\’, ‘\u0008’, ‘\u0009’,’\u000A’,’\u000B’,’\u000C’,’\u000D’,’\u0020’,’\u0022’,’\u0027’,’\u005C’,’\u00A0’,’\u2028’,’\u2029’,’\uFEFF’.

So I tried a Java snippet, but I am new to Java so that’s a bit more difficult for me to tackle. My code so far reads as follows:

char unicodeToRemove = { ’ ‘, ’ ‘, ‘\b’, ‘\t’, ‘\n’, ‘\v’, ‘\f’, ‘\r’, ‘"’, ‘’’, ‘\’,’\u0008’,’\u0009’,’\u000A’,’\u000B’,’\u000C’,’\u000D’,’\u0020’,’\u0022’,’\u0027’,’\u005C’,’\u00A0’,’\u2028’,’\u2029’,’\uFEFF’};

int i, x, result=-1;
for (i = 0; i < unicodeToRemove.length; i++) {
x = unicodeToRemove[i];
result = c_domain.indexOf(x);}
out_new_result = result;

If result != -1 in the resulting column, I know some type of white space has been detected. The problem is that this code returns always the same number (65500). There must be something stupid I am not seeing, but I cannot figure it out.

This is more a Java related question than a strict Knime question, however if you have alternative solutions to solve the problem I exposed above, feel free to share them. The Java snippet node seemed the most appropriate way to tackle this problem thoroughly and consistently. But, again, I am open to any other approach.

Thank you!
Kind regards,
Berti

Daniel_Weikert · February 6, 2021, 11:47am

Hi
what about the REGEX replacer option in the string manipulation node and only keep what you want
something like
“[^a-zA-Z0-9]” with “\s”
best regards

system · August 7, 2021, 11:47pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.