Regex in KNIME

And_Z · June 5, 2019, 11:41am

Dear all,

i hope it’s fine that i just highjack the topic. I also have a problem with a regex.

I tested the regex on several websites in order to fit it to my needs, however, once I apply it to my data in knime I don’t get the same filter results.

The regex is:
^(([a-zA-Z0-9]+(?![^$]*$)% )([a-z]+).*)|([a-zA-Z0-9]*(?![^$]*$)%)

and a subset of the data:
20% phenylephrine HCl cream, Solvay/SLA Pharma
10% lidocaine (vaginal gel, BDS, dysmenorrhea/pelvic pain), Juniper
oxybutynin (3% gel, overactive bladder), Antares/Allergan
2% diltiazem HCl cream, Solvay/SLA Pharma
0.005% latanoprost (ophthalmic formulation, glaucoma), Senju Pharmaceutical
econazole nitrate 1% (topical foam, tinea pedis), Exeltis USA Dermatology
econazole nitrate 1% (topical foam, tinea pedis), Quinnova
PreM80%E

the once in bold are supposed to be hit whereas oxybutin isn’t supposed to be a hit.

I included the regex in a rule-based row spliter node using the MATCHES condition but i don’t get all the terms in the filter e.g. PreM80%E and 0.005% latanoprost are missing.

does anybody have an idea what might be the problem?

many thanks in advance

best

Andreas

ipazin · June 5, 2019, 1:44pm

Hi Andreas,

it is (was) fine but maybe better to be in own topic

Anyways when I tried checking your regex I got pattern error so I could test it. Can you check it?

Br,
Ivan

quaeler · June 5, 2019, 5:22pm

I notice your regex magically becomes italicized in the middle of it below - perhaps because it contains a _ Please make sure to properly markup any accuracy critical information in the forum posts by surrounding it by back quotes (the ` character)

armingrudd · June 5, 2019, 6:14pm

Hi,

First of all, please make clear what your input and desired output are. Although a part of text is provided, I cannot understand what you want to do with that. If you want to split the string, which parts should become separated and let us know what the rule is.
Second, to split strings using regex, you have to use the Regex Split node.
Third, pay attention to what @quaeler has mentioned.
And the last point: you can test your regex here.

Best,
Armin

And_Z · June 6, 2019, 8:34am

Dear all,

firstly thanks for moving/creating a new topic.
Secondly I checked the regex again and set it in `

concerning the rule-based row splitter
in the node descriptions it says:
grafik

to me this sounds as if you could use a regex with that node and it also seems to work but only partially.

What I’m trying to do is to separate all the terms matching the regex
in my case that would mean that as a hit of the regex I want to get:

20% phenylephrine HCl cream, Solvay/SLA Pharma
10% lidocaine (vaginal gel, BDS, dysmenorrhea/pelvic pain), Juniper
2% diltiazem HCl cream, Solvay/SLA Pharma
0.005% latanoprost (ophthalmic formulation, glaucoma), Senju Pharmaceutical
econazole nitrate 1% (topical foam, tinea pedis), Exeltis USA Dermatology
econazole nitrate 1% (topical foam, tinea pedis), Quinnova
PreM80%E

separated from:
oxybutynin (3% gel, overactive bladder), Antares/Allergan

concerning the regex. i tested it on the example data and it does match them:

however some matches are only partially e.g. PreM80%E and 0.005% latanoprost (ophthalmic formulation, glaucoma), Senju Pharmaceutical
econazole nitrate 1% (topical foam, tinea pedis), Exeltis USA Dermatology
econazole nitrate 1% (topical foam, tinea pedis), Quinnova

and are also missing in the rule-based row splitter output, whereas the others are in.

Therefore i would guess that the solution is, that only full matches are splitted.

I’ll try to get the regex good enough to match all the ones i need, still it would be interesting to know whether only full matches are included by the node.

best

Andreas

ipazin · June 6, 2019, 9:07am

Hi Andreas,

Your guess is correct. Actually MATCHES operator is the one that is TRUE only if whole input is matched.

If you want try explaining your logic behind regex/splitting and someone might help/advise you

Br,
Ivan

And_Z · June 6, 2019, 9:16am

Hi @ipazin

thanks for the confirmation.

well… logic …

I guess I got what I wanted by the following regex:

^(([a-zA-Z0-9\.\s]+%(?![^$]*$)).*)

thanks a lot for the input guys

best

Andreas

ipazin · June 6, 2019, 9:21am

Glad you solved it!

Have a nice day,
Ivan

armingrudd · June 6, 2019, 9:22am

Right! I missed that. Sorry. I have edited my last reply so it won’t mislead anymore.

Do you have each line of the text here in a separated row?
Row splitters split rows into two output tables.
If this is the case here then let me know what is the rule to specify such rows to exclude them.
In Rule-based Row Splitter you can match either groups to split and send them to either ports. So I think in your case it is easier to match rows like
oxybutynin (3% gel, overactive bladder), Antares/Allergan
and send them to second port for example.
Here is the expression you need in Rule-based Row splitter: (regarding your own regex where you were excluding any line in which there was a % in parentheses)
$column1$ MATCHES ".*$.*%.*$.*" => TRUE

And here is an example workflow:
Splitter.knwf (13.6 KB)

Best,
Armin

system · December 5, 2019, 9:22pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.