I'm trying to cluster a series of data entries, based on fingerprint bitstrings (over 100,000 entries).
Using the Distance Matrix Calculate node, I try to calculate based on Tanimoto distances, and then use those for clustering, but I find even on well spec'd i7 machines, that I run out of memory and disk space trying to complete all the calculations.
Is there some alternative approach I should be using for handling such a dataset?
You are making a triangular matrix of 100.000 * 100.000 , that is a dataset of around 5.000.000.000 rows.
Such a job will kill any workstation, and is silly to try.
You need to go back to the reason for clustering, and see if you can do some pre-filtering.
Another aproach could be to use other properties of the compounds and bin them on those properties. You could think of logP, MolWeight, TPSA, etc.
I found another existing method to do such a clustering: Sphere Exclusion Clustering.