PAINS filter workflow.

unknown_user · August 10, 2011, 2:01pm

Hi Mikhail,

When I used the parallel nodes in the Indigo workflow, KNIME wasn't 'unstable' for me. It just didn't use more than 1 core.

I think the Indigo nodes worked really well. It seems that possibly as few as 21 SMARTS strings need to be edited to match the SLN set.

David Lagorce updated the FAF-Drugs2 server recently to included the PAINS filters implemented using the OpenBabel libraries. He found that 26 SMARTS strings needed to be hand edited to obtain the same matches as the SLN strings, but didn't report what the changes were. doi:10.1093/bioinformatics/btr333

David has in the past made the FAF source code available on his site, so hopefully we will eventually be able to check what changes were made to those strings, and run them through the KNIME workflow. If we can match the same outcome of the original SLN filters, the mismatches we observed were due to the SLN-SMARTS conversion.

I am more concerned with the RDKit outcome. It could just be the integration with KNIME. I'm planning on compiling the RDKit source code from scratch and testing the filters outside of KNIME (if anybody has done this already, do tell!).

And then there were the CDK nodes. A lot of the SMILES strings in the test set were rejected by the Molecule-to-CDK node. It was suggested to me that this was because of the old version of the CDK being used in KNIME 2.2 and 2.3. But the same problem persists with the updated version of CDK in KNIME 2.4. As I couldn't use the complete test set, I didn't proceed with the CDK version (but it would be nice to have that too).

Mikhail, what would be truly helpful would be some additional functions to implement the same pre-processing used in the original publication. Canonicalization (is that a word?), aromatization, and de-salting are there. Neutralizing ions from de-salting output (CO2- to CO2H, etc), function group standardization, and adding explicit H to only N, would allow users to exactly match the pre-processing conditions on their own data sets.

Regards,

(the other) Simon

[edited after re-reading Lagorce's paper to better reflect his findings]

P.S. Yes, we noticed the irony (too late) of using opensource software, open data, but publishing in a closed journal behind a paywall.