Workflow for reading SMILES and cleaning compounds for molecule descriptor calculation


I am looking for some advice about a suitable workflow for reading in compounds in a SMILES format, preparing the compounds to then calculate descritpors. I usually use MOE for this however i am now exploring free nodes in KNIME to carry out this procedure.

So my questions are:

What is the best way and most suitable nodes for reading in SMILES and then clean them ready for 2D and 3D calculation? I would need to de-salt the compounds and optimise the geometry plus i am unsure whether to include explicit hydrogens etc. If there are nodes that error check the compounds aswel that would be helpful. I have looked into RdKit, CDK but i would like to see what others are doing.

What molecular descritpor nodes would people recommend? I have looked at Padel, RdKit, CDK and indigo descriptors. Again would like free nodes however i was finding that some of the calculated descritpors for example number of rotatable bonds was incorrect so i think this is becuase my pipeline for reading and cleaning my compounds is not correct.

Any help will be much appreciated,

Danielle Newby

I can second the PaDEL usage, there is every descriptor you can imagine in it.

However, take care to not install the CDK and PaDEL nodes together. There are long standing conflicts that make it nearly impossible to even reliably save and load workflows if both PaDEL and CDK are installed.



To start with, you can read the smiles as a regular tab-separated text file with an ID column and then the smiles. Use the normal File Reader node to do this, tick the 'read column headers' box, right-click on the smiles column header, and changes 'type' to Smiles. Now it will recognize them as structures.

I tend to do most cheminformatics stuff with the RDKit nodes. In this case I would go through:

File Reader -> RDKit From Molecule -> RDKit Salt Stripper -> RDKit Add Hs - > RDKit Generate Coords -> RDKit Optimize Geometry -> RDKit Descriptor Calculation

In each step I remove the structure source column. To my knowledge none of the RDKit descriptors are 3D-dependent so maybe it's better to have a look at the PaDEL node, where there are a number of 3D descriptors. In that case you should use the 'Molecule to CDK' that comes with the PaDEL nodes, and put it after the RDKit Optimize Geometry node, followed by the PaDEL-Descriptor node.

There are many variants of this, and many parameters to play with under the different nodes, so other people probable have other approaches. Best to fiddle around yourself a bit to find out what works best. It's good fun!