String Manipulation (Multi Column): Regex PatternSyntaxException: Illegal repetition

Hi,

while working on a universal solution to detect Unicode Character Classes, building upon the solution from @takbb in

I happen to notice that when a Regex is defined via a variable, with or without utilizing loops, the following exception is thrown:

WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{M}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Co}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Pc}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{C}\\p{Cc}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Pd}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Me}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Cf}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Cs}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{L}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Ll}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Lu}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Zl}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Sm}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Mn}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{N}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{No}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Zp}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{P}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Po}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Z}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Zs}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{S}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Sc}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{So}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Sk}.*"
       ^
WARN  String Manipulation (Multi Column) 5:75       Evaluation of expression failed for row "Row0": java.util.regex.PatternSyntaxException: Illegal repetition near index 7
".*\\p{Cn}.*"
       ^

Best
Mike

Hi @mwiegand ,

I haven’t tried this out but I suspect that the extra backslash is potentially causing the problem here. The double “escape” is needed when entering directly from the keyboard in the node configuration because it needs to tell the node to not treat the backslash in the entered pattern as an escape.

So when you type the pattern
".*\\p{Cn}.*"

The "\\p" is interpreted during configuration as "put a literal "\" followed by a “p” into the resultant string. Then the actual value of the string as far as the final call to the regex function is ".*\p{Cn}.*".

So this is an interesting “gotcha” because what it means is that if you are supplying the regex string in an already populated variable, then you shouldn’t include the additional \ character because it needs to be in the form that is directly usable for regex, and not in the form that would be used just to get the required typed string through the node configuration.

I hope that makes sense but I’m not sure what I’ve written is very clear :joy: and further I haven’t checked if I’m right :thinking:.

1 Like

Good morning @takbb,

I thoroughly tested this before submitting the ticket :wink: Whilst it doesn’t throw an exception, the regexMatch doesn’t work. That is why I added the simple examples above. I updated the workflow to make this tiny detail more clear.

Cheers
Mike

Good morning @mwiegand ,

This morning I’ve actually loaded the workflow to take a look. Yesterday I was guessing (even more than I normally do!) :wink:

In this node :
image

… you are doing the replacement of single backslashes with double-backslashes. But you shouldn’t do it. The content of the variable needs to be the actual regex pattern (with single backslashes). The double-backslash is only a mechanism for actually putting a single backslash into the variable when entered directly as a literal. If the content to be put in the variable is already in the single backslash form, then it should be left that way.

Additionally, in both of the String Manipulation nodes, you shouldn’t be concatenating the additional double quotes at beginning and end. Double quotes around a string are, again simply the mechanism used to type in a String, but they do not form part of the value contained in the string variable.

If you look at the regex patterns in the table coming out of the above node you will see this:

e.g. the “Letters” pattern is shown as

".*\\p{L}.*"

but the actual regex pattern you really want is simply this:
.*\p{L}.*

(i.e. no double quotes and no double-backslash).

The secondary issue with the lower branch is that you haven’t included a row filter as you did with the upper branch, so the Table Row to Variable is picking up the first of the regex patterns (Character Marks) and not the Letter pattern

So the primary issue is that the String Manipulation nodes, for creating the regex pattern should simply be this:

join(".*", $charRegex$, ".*")

Then the output will be correct (I think!)

image
because the regex pattern in the variable will be as expected:

I hope that helps, but let me know if further (or better) explanation is needed :slight_smile:

1 Like

Hi @takbb,

Thanks for our feedback and yes, the inner workings of Knime when utilizing variables can be quite confusing especially when testing w/o variable before to derive and verify an approach.

I have cleaned up the workflow as it contained aspects of another two tests I am conducting which might make our exchange unnecessarily complex:

  1. RegEx does NOT match sub- but only entire string (no modifiers accepted)
  2. Multi-Line Strings are difficult to match as “global match” can’t be set as modifier. Not even “.*” works

Certainly that unnecessary complexity caused the disjoint as I am now able to reproduce the expected results. Both RegEx functions, contrary to the replace function, do not support modifiers and from the function description it is not clear that the entire but NOT a sub-string string is required to match.

Therefore, if you have a string like “abc” and use the Regex “\p{L}” it will NOT match. But, “\p{L}+” will.
Here is an example how I’d expect RegEx to work in the String Manipulation node. Similar to String Replace at least where the entire or all occurences can match.

Anyways, there still is another disjoint. As displayed in the above screenshot “\” are required. Otherwise the syntax becomes invalid.

When “\” is used in the variable, regardless of enclosing quotes being used, an exception is thrown. I ran all scenarios to use RegEx as a variable:

  1. Default – No Enclosing Quotes, w/o Double Backslash yields no match
  2. Enclosing Quotes, w/o Double Backslash throws exception
  3. Enclosing Quotes + w/ Doubled Backslash yields no match
  4. No Enclosing Quotes, w/ Doubled Backslash throws exception

That fundamental disjoint from the actual function definition, the lack of modifiers and fact that, for whatever reason, Knime demands full string match yet struggles with multi-line, feels inherently broken, doesn’t it?

Thanks for your guidance which was tremendously helpful. What is your take in this? I hope this might get addressed by the Knime develops.

Cheers
Mike

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.