I need to compute pubchem fingerprints for a minimum of 50 million molecules. Using a normal desktop (4-8 gb RAM) and knime cdk nodes it’s only been able to complete 30% in 24 hrs and looks to be stalling.
How can i make it faster or more scalable? i have already used the parallel chunk nodes. Is there another pubchem fingerprinter i can call?
Hi InsilicoConsulting,
You could try using streaming execution and there reduce amount of rows processed to a certain reasonable number (depending on your machine)
Here is more information on streaming in KNIME Analytics platform
https://www.knime.com/blog/streaming-data-in-knime
Are you using the Fingerprints node from CDK community extensions? It took 9min to generate circular fingerprints (radius 2) for 2 million of compounds on my machine (8Gb allocated for KNIME Analytics platform)
Best,
Daria
Normal CDK pubchem fp works, so does looping and parallel looping although its nowhere as fast as your experience. I am using smiles input.
I tried streaming after encapsulating/wrapping nodes in a metanode. On both linux and windows knime version 36 , it threw the following error
Execution failed: Incorrect implementation; the execute method in FingerprintNodeModelreturned a null data table at port: 0
ERROR Fingerprints 3:1:11:0:2 (IllegalStateException): Invalid result. Execution failed, reason: data at output 0 is null.
ERROR Cardinality 3:1:11:0:4 (DataContainerException): Adding rows to table was interrupted
Another error
:11:0:2 Execution failed: Incorrect implementation; the execute method in FingerprintNodeModelreturned a null data table at port: 0
ERROR Fingerprints 3:1:11:0:2 (IllegalStateException): Invalid result. Execution failed, reason: data at output 0 is null.
ERROR Cardinality 3:1:11:0:4 (RuntimeException): java.lang.InterruptedException
ERROR WrappedNode Output 3:1:11:0:6 (DataContainerException): Adding rows to buffer was interrupted
Hi InsilicoConsulting,
CDK nodes do not support streaming.
If you are not bound to the type of Fingerprint you could compute Morgan fingerprints with either CDK or RDKit nodes. The RDKit nodes support streaming.
Hope it helps.
Daria
Thanks Daria! I was primarily using pubchem fp as it gave good results compared to indigo or rdkit ones but will revisit them