Regex split question

aseeber · February 12, 2020, 10:48pm

Hi,

I have relatively simple question that I’ve been banging my head on for a while.

I would like to use the Regex Split node to split text like the following:
gH2AX_694;53BP1_159

Ideally, I would like to have search and capture expression that searches for “gH2Axxxx” and “53BP1xxxx” splitting them into two columns.

I know how to do this with cell splitter but I would really like to understand how regex works in Knime. I’ve tried regex101 and the solutions I get there don’t seem to work in Knime.

Does anyone have a suggestion on the best way to go about this?

qqilihq · February 13, 2020, 6:24am

Hi aseeber,

this is a perfect case for the brand new Regex Extractor node in Palladian 2.0 – especially if you’re used to more intuitive tools such as Regex101 you’ll feel right at home. See here for the announcement:

You can find an example workflow on NodePit: regex-split-question-20968 — NodePit

Here’s the node configuration for your data:

I used the following regex:

(?<firstValue>gH2A[A-Z0-9_]+);
(?<secondValue>53BP1[A-Z0-9_]+)

It uses the “named capture groups“ firstValue and secondValue which give the name of the output columns. Any way, when editing the expression you’ll always see a preview of the results as you’re used to from Regex101.

Any feedback welcome!

– Philipp

PS: An alternative approach could be to define a “tokenization expression”. This makes sense, if you have a variable number of items separated with a ;

(?:\w+|[^;]+)

It will basically create a match for each value between the semicolon:

aseeber · February 14, 2020, 3:39am

Thanks so much, this is fantastic and just what I was looking for!

I’m testing it out and so far it works perfectly.

system · August 14, 2020, 3:41pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.