Extract certain numbers from heterogenous inputs

Hello KNIME Community!

I have a problem where I need to extract numbers and phrases from heterogeneous inputs (no uniform format). This is assay description data from ChEMBL. In the screenshot you can see the different inputs. One of the things I need to extract and append to a new column is the concentration so I was thinking just extract the numbers but I cannot use the index method. Now I know some rows do not have numbers at all and some have multiple numbers. I’m not sure how to differentiate that if I can at all. Thanks in advance!!

Hello @Coral_OBrien,

regular expressions are usually a way to go in these cases. However to avoid guessing can you share some input data and expected output? (Keep in mind that we can not copy from screenshots :wink: )

Br,
Ivan

2 Likes

Hi @ipazin,

For sure. Here are some inputs and expected outputs.

Input 1 - “Millipore: Percentage of residual kinase activity of NEK3 at 1uM relative to control. Control inhibitor: Phosphoric acid* at 0.3uM. Buffer: 8 mM MOPS pH 7.0, 0.2 mM EDTA”
Expected output 1 - “1uM”

Input 2 - “Inhibition of NEK3 (unknown origin) using [gamma-33P]ATP assessed as residual activity at 3 uM”
Expected output 2 - “3 uM”

Input 3 - “Inhibition of wild-type human full length NEK3 (M1 to R506 residues) expressed in mammalian expression system assessed as residual activity at 50 nM by Kinomescan method relative to control”
Expected output 3 - “50 nM”

Input 4 - “Inhibition of human NEK3 at 10 uM after 60 mins by TR-FRET assay”
Expected output 4 - “10 uM”

Some inputs do not contain a concentration so I am ok with missing value on those.

Thank you,
Coral

Hello @Coral_OBrien,

tnx for examples. Is it safe to say that logic is to extract one or more digits followed by uM or nM with optional space between two?

If so then this regular expression in Regex Split node could do the trick:

.*(\d+\s?[u|n]M).*

Br,
Ivan

2 Likes

Hi @ipazin,

It worked for the most part, however, I think it’s only capturing one number that’s before the uM or nM. For example, 10 uM came out as 0 uM, and 0.3 uM came out as 3 uM. How do I adjust the regular expression for it to extract the correct number? Sorry I’ve never seen these regular expressions before and it’s like a foreign language and magic to me :smiley:

Thank you,
Coral

you could try adjusting @ipazin solution like

.*?(\d?\.?\d+\s?[u|n]M).*
4 Likes

Hello @Coral_OBrien,

you are right. I wrote expression here before I have tested it and then haven’t modified it after :slight_smile: You can try @Daniel_Weikert’s solution but keep in mind that it will extract only first encounter of patter. (Here is good page to learn regular expressions: https://www.regular-expressions.info/)

Br,
Ivan

Hi @ipazin,

Thanks for the explanation and resource. I used @Daniel_Weikert 's regular expression and it is working great! Thank you both!

Kind regards,
Coral

1 Like

Great that we were able to help. Please mark the solution to ensure others will find it in case similar issues occur in the future.
br

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.