String Manipulation regular expression syntax for a fuzzy match

I am getting stuck on using regular expressions in the String Manipulation node. This will be example of text analysis for new KNIME users. My limitation is that I need to use a standard KNIME node and not a language snippet, a Palladian node, or something similar.

The main idea is to detect “password,” “passcode,” or a similar term such as “passwd,” “passwrod,” or “pass code,” while excluding terms such as “passkey,” "pass " or “unsurpassed.” On https://regex101.com, I constructed this regex code that does exactly what I need by using the Java 8 flavor of Regex:

(\b)pass[ ]?[word]{2,4}(\b)|(\b)pass[ ]{0,1}[code]{2,4}(\b)

In the String Manipulator node in KNIME, my current configuration has this expression to match:

regexMatcher($Full Description$,
“.\bpass[ ]?[word]{2,4}(\b)|(\b)pass[ ]{0,1}[code]{2,4}\b.”)

I have tried different variations such as with/without .* and with/without parentheses around \b . My sample data set contains plenty of “password” and “passcode” string matches, but I only get False matches in the output table. What am I missing?

Thank you for your help!

I broke down your regex string piece by piece, then put it back together. I got the desired result using:

regexMatcher($column1$,"pass[ ]?[word]{2,4}|pass[ ]{0,1}[code]{2,4}")

image

4 Likes

Thanks! I neglected to mention that the terms are embedded in longer sentences, such as “I need to reset my password” and “Please change the pass code to the office door.” I tried your solution, but it did not work on the longer phrases.

Basically you only need to add wildcards to elsamuel’s solution:

image

NB: word boundaries in String Manipulation nodes need to come with an additional backslash. If I had only used \b instead of \\b, I would have gotten False in every row.

5 Likes

Thank you to @G47_2 and @elsamuel ! The double backslash and additional parentheses solved the problem.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.