PAINS filter workflow.


We have made available a KNIME workflow on the MyExperiment.org site implementing the WEHI pan-assay interference structure (PAINS) filters using the Indigo sub-structure nodes.

http://www.myexperiment.org/workflows/2164.html

The workflow contains the PAINS filters in SMARTS format, and a reference set of 10k SMILES strings from WEHI, which is used as input if you provide no other input. These can be used to compare results between other chemistry packages, or other versions of the PAINS filters.

We were hoping that our soon to appear paper describing this workflow would be released on the journal's web site by now (it was accepted a while ago), but in anticipation of the KNIME workshop later in the week, we're drawing the community's attention to the workflow.

The paper compares the original Sybyl/SLN system to RDKit/SMARTS and Indigo/SMARTS using KNIME.

It also re-emphasizes the need to pre-process your structures as described in the original PAINS publication before using the filters. The built-in reference set has been pre-processed according to that paper.

(the other) Simon

Hi Simon,

First of all I wanted to say "good job" on running these studies and sharing the results/workflows/PAINs filters!  Also, sorry it's taken so long to look at the workflows and give some feedback - certainly not due to a lack of interest; just a lack of time!  : )

I would be very keen to read the associated paper, particularly to see your assessment of the differences between the RDKit and Indigo matching - which journal has this been submitted to?

When I first ran the workflow, I thought that one potential downside is the (lack of) speed when processing relatively large input files.  It then occured to me that one could maybe parallelise the loop if the input file were chopped-up...

I initially tried this with the Indigo workflow, but there seems to be a memory issue related to the Indigo nodes that causes Knime to become unstable (at least in my hands) when attempting the 'parallelisation'.  Also, when the parallel substructure matching was running, I could not see much evidence of more cores being employed.

In contrast, the RDKit searching parallelises quite well.  I have attached a modified form of your workflow, and will explain a little below:

 

All I did was split the input 10000 rows into 5 sets (using Java Snippet Row Filter nodes), then passed each set into its own Substructure Filter node and combined the output with Concatenate nodes.  The parallel Substructure Filters are all running inside the one loop (ie all 5 chunks have to process the same PAINs before moving onto the next one).  I have not checked to see if there is a difference running 5 individual loops...

I am running Knime 32-bit on Windows 7 64-bit, with an i7 CPU (8 x Q740 @ 1.73GHz).  When running the single RDKit loop, the workflow takes 11min30sec and I observe ~ 15-20% CPU usage by knime.  When running the 'parallel' version, this drops to 4min50secs, and I see 60-70% CPU usage (with no increase in RAM).

 

Anyway, I doubt this competes with 'Pervasive DataRush' (!), but I thought the observation was worth sharing - particularly for anyone with more cores (and compounds!) than me.

 

Kind regards

James

James, thank you for your kind review.

The paper has just been published on-line overnight:

doi:10.1002/minf.201100076

The parallelizing loop nodes appeared in KNIME well after our submission to the journal, but I tried using them when they were 'released' for both the Indigo and RDKit versions of the workflow and can confirm your observations. I didn't see more than 1 core being used.

I won't be changing the two workflows on myExperiment.org (other than bug fixes), as they are referred to in the paper. But the workflow (and contained data) can be branched on myExperiment.org to add in extra functionality - such as parallelization.

The myExperiment.org Web site also lets us track downloads and derivative workflows, plus explicitly state that the workflow and data can be used for commercial use, re-packaged, and derivatized (CC- BY).

Hopefully with additional updates to the parallizing nodes, these workflows can be made more efficient.

Regards,
(the other) Simon

 

Hi Simon,

Thank you for implementing PAINS filter workflow with Indigo. It is very interesting to see a comparison between RDKit and Indigo. We will use your workflow to check the results and possible speed improvements carefully.

James,

Currently Indigo nodes do not run in parallel, and we are working to add support of that. But they should work even if there are running in parallel in Knime by executing sequential code internally. That is why you didn’t observe usage of more cores. Could you explain how to get Knime being unstable using Indigo? This information will be very helpful.

With best regards,
Mikhail Rybalkin
GGA Software Services LLC

Hi Mikhail,

When I used the parallel nodes in the Indigo workflow, KNIME wasn't 'unstable' for me. It just didn't use more than 1 core.

I think the Indigo nodes worked really well. It seems that possibly as few as 21 SMARTS strings need to be edited to match the SLN set.

David Lagorce updated the FAF-Drugs2 server recently to included the PAINS filters implemented using the OpenBabel libraries. He found that 26 SMARTS strings needed to be hand edited to obtain the same matches as the SLN strings, but didn't report what the changes were. doi:10.1093/bioinformatics/btr333

David has in the past made the FAF source code available on his site, so hopefully we will eventually be able to check what changes were made to those strings, and run them through the KNIME workflow. If we can match the same outcome of the original SLN filters, the mismatches we observed were due to the SLN-SMARTS conversion.

I am more concerned with the RDKit outcome. It could just be the integration with KNIME. I'm planning on compiling the RDKit source code from scratch and testing the filters outside of KNIME (if anybody has done this already, do tell!).

And then there were the CDK nodes. A lot of the SMILES strings in the test set were rejected by the Molecule-to-CDK node. It was suggested to me that this was because of the old version of the CDK being used in KNIME 2.2 and 2.3. But the same problem persists with the updated version of CDK in KNIME 2.4. As I couldn't use the complete test set, I didn't proceed with the CDK version (but it would be nice to have that too).

Mikhail, what would be truly helpful would be some additional functions to implement the same pre-processing used in the original publication. Canonicalization (is that a word?), aromatization, and de-salting are there. Neutralizing ions from de-salting output (CO2- to CO2H, etc), function group standardization, and adding explicit H to only N, would allow users to exactly match the pre-processing conditions on their own data sets.

Regards,

(the other) Simon

 

[edited after re-reading Lagorce's paper to better reflect his findings]

 

P.S. Yes, we noticed the irony (too late) of using opensource software, open data, but publishing in a closed journal behind a paywall.


Hi Mikhail,

I have attached the 'parallelised' version of Simon's workflow where I was observing errors.  On my system the workflow eventually halts (last run this occured at Row59 being passed by the Chunked Loop Start; with the following error in the Console:

ERROR Substructure Matcher Execute failed: array: reserve(): no memory
ERROR Substructure Matcher Execute failed: array: reserve(): no memory
ERROR Substructure Matcher Execute failed: array: reserve(): no memory
ERROR Substructure Matcher Execute failed: array: reserve(): no memory
ERROR Substructure Matcher Execute failed: array: reserve(): no memory

While the workflow is running, I also see other errors if I happen to double-click on the Substructure Matcher nodes:

ERROR NodeContainerEditPart The dialog pane for node 'Substructure Matcher 0:2:42:33' has thrown a 'OutOfMemoryError'. That is most likely an implementation error.

And other areas of KNIME give signs of being out of memory (eg some connecting lines disappearing, etc)

With exactly the same workflow I have also observed knime crashing with a Microsoft Visual C++ Runtime Error...

I am running on Windows 7 with 32-bit Knime 2.4.1 with the latest version of the Indigo nodes, and Xmx 1024m in the .ini file.

 

Kind regards

James

James,

I've run your workflow without encountering errors:

KNIME 2.4.1
Indigo 1.0.0.965
Xmx1500m

Tested on  Mac OS 10.6.8, SuSE-64 11.2, Win XP-64 SP2 (using KNIME-32).

You can try my version of the parallel workflow using the new KNIME Labs parallel nodes (attached). Of course, only 1 core seems to be used at the moment.

(the other) Simon

Hi Simon,

Thanks - I will post back in a little while and update on how I get on with your version of the workflow.

In case the issues I see are related to the Xmx setting, I tried setting to 1500 - I couldn't get KNIME to start with this value, so lowered it to 1400. Tried the same workflow and it errored straight away!

I have attached the (partial) logfile in case it gives Mikhail any hints on the issue (of course it could be entirely related to my PC, and not the nodes at all!)

 

Kind regards

James

Hi again, Simon.

Well, I just tried your workflow - first of all it was good to see the new parallel nodes in context; I hadn't really appreciated how to use them before!

Now the bad news (for me at least) - your workflow gives my PC new and robust ways of crashing knime! (again looks like running out of memory)

The errors in the logfile look the same as before - other than the following (which I hope may be of use to Mikhail):

2011-08-11 08:50:40,996 DEBUG main WorkflowRootEditPart : part: SubworkflowEditPart( (WFM) Parallel Chunks 0:2:42 (MARKEDFOREXEC) )
2011-08-11 08:50:41,012 DEBUG main NodeContainerEditPart : File Reader 0:2:24 (EXECUTED)
2011-08-11 08:50:41,246 ERROR KNIME Sync Exec Dispatcher-1 SyncExecQueueDispatcher : Uncaught exception while queuing events into main thread
2011-08-11 08:50:41,308 DEBUG KNIME Sync Exec Dispatcher-1 SyncExecQueueDispatcher : Uncaught exception while queuing events into main thread
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.tryTerminate(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.workerDone(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
2011-08-11 08:50:43,336 DEBUG KNIME-Worker-1 Molecule to Indigo : reset
2011-08-11 08:50:43,336 DEBUG KNIME-Worker-1 Molecule to Indigo : clean output ports.
2011-08-11 08:50:43,336 ERROR KNIME-Worker-1 Molecule to Indigo : Execute failed: Cannot allocate 51 bytes
2011-08-11 08:50:43,336 DEBUG KNIME-Worker-1 Molecule to Indigo : Execute failed: Cannot allocate 51 bytes
java.lang.OutOfMemoryError: Cannot allocate 51 bytes
at com.sun.jna.Memory.(Memory.java:80)
at com.sun.jna.NativeString.(NativeString.java:62)
at com.sun.jna.Function.convertArgument(Function.java:498)
at com.sun.jna.Function.invoke(Function.java:258)
at com.sun.jna.Library$Handler.invoke(Library.java:216)
at $Proxy0.indigoLoadMoleculeFromString(Unknown Source)
at com.ggasoftware.indigo.Indigo.loadMolecule(Indigo.java:154)
at com.ggasoftware.indigo.knime.convert.molloader.IndigoMoleculeLoaderNodeModel.execute(IndigoMoleculeLoaderNodeModel.java:169)
at org.knime.core.node.NodeModel.execute(NodeModel.java:668)
at org.knime.core.node.NodeModel.executeModel(NodeModel.java:524)
at org.knime.core.node.Node.execute(Node.java:873)
at org.knime.core.node.workflow.SingleNodeContainer.performExecuteNode(SingleNodeContainer.java:840)
at org.knime.core.node.exec.LocalNodeExecutionJob.mainExecute(LocalNodeExecutionJob.java:100)
at org.knime.core.node.workflow.NodeExecutionJob.run(NodeExecutionJob.java:166)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at org.knime.core.util.ThreadPool$MyFuture.run(ThreadPool.java:124)
at org.knime.core.util.ThreadPool$Worker.run(ThreadPool.java:239)

 

Kind regards

James

PS  For completeness I checked, and I too am on the .965 version of the Indigo nodes

The Indigo nodes have been updated for KNIME 3.0, so I have updated the workflows over on myexperiment.org to the new versions. http://www.myexperiment.org/workflows/2164.html

The Indigo 1.1.1300.201511201230 nodes now filter 825 of 861 PAINS in the reference set.

Although new functionality would allow for simpler workflows, I have left the two original workflows the same, only updating deprecated nodes to their KNIME 3.0 equivalent.

It also looks like the Indigo libraries are now parallelizable.

(the other) Simon