The RDKit Diversity Picker node uses the MinMax method to select a desired diversity set. This method is computationally intensive for large datasets of compounds from which pick up and for large data subsets to pick up. I want to build a diversity set starting from a large dataset (let say 1 M compounds) by reducing its size by 2 order of magnitude (obtaining a diversity set of 100K compounds). When I tried to do this with RDKit Diversity Picker I ended filling up the RAM and SWAP of my system and I couldn't complete my task.
Please, can anybody suggest what are large, reasonable size of datasets (from which pick up compounds) and diversity subsets that can be obtained using this node?
If it is using the known method of MinMax, then the "reasonable size" should be the same as for any computational/theoretical method and not dependend on this specific Node?
You say your RAM gets filled up - how much RAM does your machine have and how much RAM did you assign to Knime?
A quick google for RDKit (general, not Knime specific) talks about memory usage and potentiall some necessary work-arounds with some pre-computation.
Thank you for your answer. You're right, it is not useful telling you I filled up my RAM without specifying its size. :-)
Currently my system has 32 GB RAM (of which 25 are assigned to KNIME). The mentioned experiment filled it up entirely and it also took 8 GB of my SWAP before I killed it.
Thanks for your suggestion. Effectively googling around I found something RDKit-general, not KNIME specific talking about diversity selection and memory usage (e.g. http://rdkit.blogspot.com.es/2014/08/optimizing-diversity-picking-in-rdkit.html, or http://rdkit-discuss.narkive.com/gZOi1lGi/maxmin-picker-and-python).
Anyway here my main purpose was not to re-implement new procedure using RDKit classes, but only asking suggestions to RDKit users about data sets and diverse subsets manageable with the KNIME RDKit Diversity Picker node.
I need to look into it, but I believe the KNIME MaxMinPicker implementation could, theoretically, use the same LazyBitVectorPicker that is discussed in that blog post in order to allow working with very large datasets.