regexMatcher in Column Expressions to identify fields with non-alphanumeric chars

J_Knime_Work · March 14, 2024, 4:45pm

I’m struggling with something that should be really simple. I want to use regexMatcher() in a Column Expressions node to evaluate true for records that contain chars other than alphanumeric, and evaluate false for fields with only regular alphanumeric chars.

Eg “This field 1234” → false
But “This - field” → true

ChatGPT is adamant that this expression should work, but unfortunately every record returns false even when I can see non alphanumeric chars:

regexMatcher(column("WORK_GROUP_DESC"), "[^a-zA-Z0-9]")

So what am I doing wrong? As usual I find regex functions in KNIME utterly baffling, and chatgpt is so often helpless. Typically, the function-specific help for regexMatch is also pretty useless, as is so often the way with this wonderful but patchily-documented platform - check out the total lack of explanation below

J_Knime_Work · March 15, 2024, 10:22am

I’ve been doing some more experimenting and would be grateful for assistance.

First up I can’t find any help on the regexMatcher function in the column expressions node - I searched https://docs.knime.com/ but found nothing. A search like regexmatcher site:knime.com just shows up some forum posts. So where is the actual KNIME documentation that lists what functions are available in each node? This seems pretty fundamental so I’m surprised I haven’t found it yet. I also googled regexmatcher but it doesn’t seem to be a standard javascript function, so where has it come from? How can we take the guesswork out of learning KNIME?!?

I created a test field called Text, with the following 5 records:

apple
a
a b
ball
bingo

This expression only yields true for the second record, even though chatgpt suggests it should match any instance of ‘a’.
regexMatcher(column("Text"), "a")

So I can modify it like this to yield true for any string containing an instance of an ‘a’
regexMatcher(column("Text"), ".*a.*")

But when I have a go in regex101, the expression “a” does match my string “apple”. So please can someone explain how the mysterious function regexMatcher works please, and where the documentation lives? Thanks

takbb · March 15, 2024, 10:53am

Hi @J_Knime_Work ,

From your list, the regular expression “a” will only match “a”. It won’t match “apple” because it contains other characters.

When I tried this on regex101.com, it showed the regular expression “a” as matching only the “a” in “apple”, and not the entire string. Since it doesn’t match the entire string, regexMatcher will also return false for your other strings.

A regular expression to match “a plus other characters” would be “a.*”.

With reference to your first example, where you expected matching results for “[^a-zA-Z0-9]” to be as follows:

“This field 1234” → false
“This - field” → true

Your statement is incorrect.
– That regular expression is looking for a string that is a single character that is not alphanumeric. Therefore it will return false because neither of those strings matches that pattern. (i.e. they are both multi character strings, and they also both contain spaces, which aren’t alphanumeric.)

To find strings that ARE alphanumeric (and include spaces), you could use the following:
[a-zA-Z0-9 ]+
which would return true for the first and false for the second.

To return true where the string contains at least one character not in the required set, you could write logic (e.g. with an If statement) in Column Expressions to return the opposite.
or you could use the following:

regexMatcher(column("column1"),".*[^a-zA-Z0-9 ].*")

which returns true if the string consists of any number of characters (including zero characters) followed by a character that is not in the required range, followed by any number of characters. This would return true if the string is anything other than “AlphaNumeric or Spaces”

In terms of chatGPT being “adamant”… it is never adamant about anything and it can and will often make mistakes. It probably depends on the specific wording of the question posed to it, and of course its response is based entirely on statistical analysis of its training model, so whilst it is a very useful guide, it is not guaranteed to be factually accurate and can often “hallucinate”.

For regular expression matching though, the whole point is that you write a regular expression to match the entire string. I would be disappointed if I wrote a regex match expecting a string sequence to match [A-Z]{3} (i.e. exactly 3 capital letters) and discovered that it also considered ABCD to be a match for that simply because it found 3 capital letters somewhere in the string.

Regular expression matching is different from regular expression replacement. With regex replacement, tools may replace subsets of regular expressions found within a string, and therefore not have to match the entire string.

J_Knime_Work · March 15, 2024, 1:21pm

Thank you @takbb this is an exceptionally useful reply, I’m very grateful.

Please can you explain why we need the square brackets inside the .* wildcards?

".*[^a-zA-Z0-9 ].*"

What would be the best way to extract the ‘offending’ (non-alphanumeric or space) chars into a separate column which I can then summarise and consider how to deal with? Or perhaps an inverse approach is required: replace all allowed characters with a blank, which should leave only the disallowed chars.

And similarly how would I adapt your concise expression to simply remove the unallowed chars from the string? It could be easy to do a string replace (so & becomes and) but others could just be eliminated fairly easily I guess? Thanks

takbb · March 15, 2024, 2:23pm

Hi @J_Knime_Work , the square brackets are a regular expression construct that effectively means “any character from the set of characters or character ranges defined here”, so the square brackets identifies a range of characters or classes of characters.

They aren’t “inside” the wildcards as such.

. means any (.) character
‘*’ the thing that comes before (i.e. the . reference ), occurring means 0, 1 or many times
So .* means none, one or many occurrences of any characters

For stripping out or keeping just the non alpha-numeric-etc characters, you are on exactly the right lines with replacing specific characters with empty strings. Here you can match just “substrings” (because as mentioned, regex replace usually works on substrings whereas regex matching usually works on whole strings)

so to remove all the non alphanumeric/space chars
regexReplace(column("column1"),"[^a-zA-Z0-9 ]","")

and to keep only the non alphanumeric/space chars:
regexReplace(column("column1"),"[a-zA-Z0-9 ]","")

If you are looking to “clean” certain types of characters such as emoji, you can also check out the new String Cleaner node available since KNIME 5.2.

You can also check out some older components I wrote such as:

system · March 22, 2024, 2:24pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.