regex in String Maniputation Node KNIME

Hi All,

Hope you can help.

I have a large data set with a number of chemical formulas in a column of such data set. Some have square brackets along with charges (+ and -) that I want to remove. For example (totally made up formulas) it may look something like this:

[C6H6O27P7]13-
C6H19O27P7
[C6H6O23P7]2-
C36H38N2O6
C20H18O7
[C5H8N3O]+
[C5H8N3O2]2+
[C5H4N4O2]6+
C22H26O8
[C22H27O8]+

I cannot for the life of me figure out how to remove the brackets and charges in these locations. Would really welcome some help/advice!

The data set is large (a few hundred thousand compound) so manual removal is not an option!

Many thanks!

Hi @rinawm

You could use the String Manipulation node with the configuration:

removeChars($formula$, "[]+-")

where formula is the name of the column containing the formulas from which you want to remove the brackets and charge signs.

Hope this helps!

2 Likes

Are you sure that you want to mantain numbers outside square brackets?
How can you will distiguish which numbers were inside and which ones were outside the brackets?

If you just want the content in the square brackets, then the Regex Extractor node (part of the Palladian node collection) will work if you use the expression:
[A-Z]\w+

These are the results I get:
regex

3 Likes

Are you sure you don’t want to do a proper chemical operation on these structure representations - perhaps using RDKit or the Infocom wrapped ChemAxon libraries?

Yes- I don’t need the charges. Key to this is the mol structure (on another column). Normally I’d keep them but there’s a bit of software that gets upset with the charges in the formulae!

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.