Rule Engine Node

Is it possible to write a specific/custom rule with the Rule Engine node?

for example, lets say I have the following sentence in a cell: "to select your favorite color is not so hardOn the other hand, choosing your favorite painting might be"

Can we write a rule to split the cell where there is no whitespace character between two separate sentences but the second sentence starts with an  uppercase?

another version of my question involves any type punctuation between two merged words (such as "hard.On)

Hi boraster,

Your "hardOn" example made me laugh quite a lot (but I am *very* easily pleased!).

For both cases, I would recommend the RegEx Split node.  This is very powerful, but you do need to familiarise yourself with regular expressions...  I have attached an example for your first use case (ie splitting at the first capital letter encountered).  Hopefully you can extend this to your second "hard.On" use case with a little bit of Google-ing.

Kind regards

James

PS  I don't recommend Google-ing "solving hard on problems"...

Made me laugh too....

Simon.

Good example, isnt't it :)))

I have very limited knowledge of regex and almost none of java.

Is there any specific site you may recommend for beginners?

Hi boraster,

I have zero knowledge of Java, and had none of RegEx, or any programming. I feel Im reasonably competent with RegEx now. I learnt everything from these two websites.

http://www.regular-expressions.info

http://en.m.wikipedia.org/wiki/Regular_expression

Simon.

Some people may be very disappointed when they are directed to Knime to solve their HardOn problems. I know Knime can do many things, but that's going too far.

 

Thanks a lot, Simon.

Dear James,

I forgot to thank you for your kind help for my "HardOn" problem :))))

Hi Simon,

I am really sorry to bother you but I need your help. I start getting regex a little bit but I could'nt find any way to test my expressions other than the Regex split node. However, all my trials failed and don't know where to look for answers. I am sending you a table with just one column and 4 rows which I use to exercise. I want to extract numbers and strings into separate groups like the following pattern:

(Aylık sabit ücret: )(25)( TLTeklif içeriği: )(Her yöne )(1500)( Dakika)

In fact, I want to know is it possible to extract any number or any string from rows which do not have any common pattern?

I forgot to add my failed last attempt of regex for the above problem:

(\w*\s*\w*\s*\w*:\s*)(\d*)(\s*\w*\s*\w*:\s*)(\w*\s*\w*\s*)(\d*)(\s*\w*\s*)

Hi Boraster,

Try this;

(\p{L}+\s\p{L}+\s\p{L}+):[^A-Za-z](\d+)[^A-Za-z](\p{L}+[^A-Za-z]\p{L}+):[^A-Za-z](\p{L}+\s\p{L}+)\s(\d+)[^A-Za-z](\p{L}+)[\s\S]*

 

That was an immensely complex set of text you had which required some unusual regex quotes. Some of the text appears to have spaces in, but they are not really, thus \s was not working. I therefore had to use [^A-Za-z] to say anything but a letter character. Also the text is not standard english characters, I claim ignorance in not knowing what text dialect it is, but because of this you need to use \p{L} to capture any possible type of letter, as \w only covers A-Z and a-z and _

Finally to make sure you still capture your third row, you need to capture the exta text at the end. Normally a .* will capture this but it doesnt, as you must have line breaks in there. Therefore [\s\S] is needed which covers every possibility.

Hope this helps, but you've really thrown yourself in at the deep end with this one.

In one of the links I sent captures some of this stuff, http://www.regular-expressions.info/unicode.html

 

Simon.

Also there are many nodes in KNIME now which support RegEx, so you would benefit yourself from learning it more.

Nodes include RegEx Split, Column Rename (RegEx), String Replacer, Row Filter, Row Splitter, and many other nodes which have advanced column handling support such as Column Filter etc.

For your last question, I guess there needs to be some pattern, otherwise how will the algorithm know where you want to split the text?

Hope this has helped.

Simon.

Infact thinking more about this, if you only want to have a split whenever you switch from text to numbers to text etc, a much simplified RegEx would be something like,

[^0-9]+[0-9]+[^0-9]+[0-9]+ etc..

Simon.

Thanks a lot Simon. This makes better sense now.

I hope I can figure out the whole thing because I never thought of \w as not containing all unicode characters.

I really appreciate you having spent so much time for answering my questions.

Thanks again my friend.

Bora

You might want to familiarize yourself with the specific details that Java adds to Regex, too. You can, for example, enable Unicode mode with "(?U)" or use classes like "\p{Graph}" (all printable characters) or "\p{Digit}" (all digits).

For your example, using the latter, "(?U)\p{Digit}+" selects all numbers, "(?U)[^\p{Digit}]+" selects all non-numbers, and if you need an alternating list of both, you could for example replace every "(?U)(\p{Digit}(?=[^\p{Digit}])|[^\p{Digit}](?=\p{Digit}))" with "$1#" and then split at "#". This last one matches every border between a number and a non-number and vice versa and also demonstrates look-ahead, which is a really powerful feature.

If you want to learn more: I'm constantly looking up details in this documentation, it's really helpful.

thanks a lot, Martin.  This was very educating. I agree with you about getting familiarized oneself with the java regex capabilities and options.

I add the documentation link to my favorites for further references.

 

Bora

Hi,

I have a differen problem with Rule engine node. when I use it for my data and then I use Dicision Tree nodes, the tree just have that 3 values of my target variable and it get disconnected from other variables and has no other branches, also has the accuracy 100%

I use my Rule Engine node just after File Reader node, is it the correct place?

this is my code for filtering: 

$TEN$ =1 => "T"
$TEN$ =2 => "F"
TRUE => "C"

I have 6 valuse (0,1,2,3,4,5) for this TEN variable.

Any advice?

Laleh