Performing Distance Matrix calculations on large numbers of fingerprints

I'm trying to cluster a series of data entries, based on fingerprint bitstrings (over 100,000 entries).

Using the Distance Matrix Calculate node, I try to calculate based on Tanimoto distances, and then use those for clustering, but I find even on well spec'd i7 machines, that I run out of memory and disk space trying to complete all the calculations.

 

Is there some alternative approach I should be using for handling such a dataset?

You are making a triangular matrix of 100.000 * 100.000 , that is a dataset of around 5.000.000.000 rows.

Such a job will kill any workstation, and is silly to try.

You need to go back to the reason for clustering, and see if you can do some pre-filtering.

 

Another aproach could be to use other properties of the compounds and bin them on those properties. You could think of logP, MolWeight, TPSA, etc.

I found another existing method to do such a clustering: Sphere Exclusion Clustering.

http://www.daylight.com/meetings/mug04/Delany/spherex.html

http://www.chemaxon.com/jchem/doc/user/SphereExclusion.html

Ymmv.