I have a large data set of molecules (500 000~), I transform the structure into fingerprint (ECFP4) using the RDKit fingerprint node or the CDK fingerprint node. Then I want to calculate the Tversky similarity all against all using the Indigo fingerprint similarity.
The problem I encounter is the time of calculation. I have tried many workflow changes to optimize the time but none of them are efficient.
What did i tried :
- Put the output of the Fingerprint node on the Two input of the similarity node
- Enhance the heap size for knime to 6g
- Make a loop that take 1 line of the set as reference for the similarity node and compare the whole set to it
- and some other little changes that I can’t enumerate and clearly did not worked.
So maybe someone have an Idea ? Or maybe that is just impossible to perform this in a reasonable time with thise node or with Knime ?
Thank you by advance for every answer.