Support for RegEx in Rule Based Nodes

Hi,

It would be particularly useful to have RegEx facilities in nodes like the Rule-Based Row Filter/Splitter nodes. Simple wildcard support is rather restrictive.

 

Simon.

1 Like

Do you mean besides the MATCHES keyword? What is your use case? Would you like to refer to named groups with a syntax like this: ?<groupName> and compare them to other text? (When we are after an AND or in an outcome and no other group shares that name (I think supporting $number would be a bit tricky, usually would be ambiguous when there are multiple groups)?) I guess this is a bit too hard to explain. Do you need something like the Java find? That can be simulated with MATCHES using the .*? prefix and suffix.

(I think allowing computations like some of the ones in string manipulation or math expression in outcomes would be nice too (maybe in conditions too), though I guess those are not supported by PMML RuleSet.)

Hi,

it's really for being able to define matches like;

text/b to ensure you only match text as a word and not the word texting.

or

simon[0-9]{1,3} to say simon must be followed by 1 to 3 numbers.

 

unless it's my poor understanding, I don't think this is possible in these types of nodes, can LIKE or  MATCHES do this?

Simon. 

Hi Simon,

   I am sorry for not being clear enough. I hope the attached workflow (was not sure whether you are on 3.x, so it was created with 2.12, but there were no changes that should affect this) provides a satisfactory answer. It should work similarly to the filters too, though the PMML RuleSet does not support regular expressions, so that is not available.

   For the record, here are the rules:

$Input$ MATCHES "(?i).*?\btext\b.*" => "text"
$Input$ MATCHES "(?i).*?simon\d{1,3}.*" => "Simon"
TRUE => "default"

and here is with slashes in case you need escaping (for example quotes, but in this case it is not necessary):

$Input$ MATCHES /(?i).*?\\btext\\b.*/ => "text"
$Input$ MATCHES /(?i).*?simon\\d{1,3}.*/ => "Simon"
TRUE => "default"

The input table looks like this:

Input
no textual information or Simon in this column default
texting is not allowed, but Simon007 is welcome Simon
Every text should have a meaning. text

I hope this helps. Cheers, gabor

What I tried to explain: You cannot have a rule like this:

$Input$ MATCHES "(?i).*?simon(?<agent>\d{3}).*" AND ?<agent> > "107" => "Simon"

 

1 Like

Brilliant Gabor, sorry for not understanding your initial response, I never knew this was possible. This will make things much easier. 

Simon.

Gabor, one more question.

So is the MATCHES term using the RegEx syntax then, or a slight modification of it.

The reason I ask is because what is the term (?i) for, that you use, I am unfamiliar with this.

Simon.

 

Yes, it is just the basic Java regex (though the quotes are not like the Java Strings, these do not support escaping, but for regexes that is better). The (?i) is just for case insensitive matching (you might notice that in the pattern Simon was with lowercase, but in the input it was titlecase and with this option it was matched). You can check the other possible switches on the Javadoc of the Pattern class.
 

Ah it's ok, Google tells me in RegEx it's for case insensitivity for all succeeding characters.

You learn something everyday. And that will be quite useful.

In which case MATCHES is just RegEx, whoopee!

 

Thanks as always Gabor.

a fountain of information!

simon.