Hierarchical Clustering (DistMatrix) node issue


I am using the Distance Matrix Calculate node to gather Tanimoto Distances on 7500 molecule fingerprints (from RDKit), followed by the Hierarchical Clustering (DistMatrix) node.

However, unless the clustering node memory policy is set to "write to disc", then the node eventually consumes all the PCs memory and KNIME dies (using keep small tables in memory).



Are you using the "Hierarchical Clustering (Distance Matrix)" node? This doesn't have a data output, so the memory option should not have effect. Are you sure that it is really related to the option?

Yes, the "Hierarchical Clustering (DistMatrix)" node is the node in question, with the viewing output port. I will recheck the "Write to Disc" option just in case I was lucky and it worked by chance when changing the option. Either way, it has been causing significant problems on large datasets.

I'll let you know when I get chance to try a few more things to narrow down the issue.



Hi, the error seems to occur whether write to disc is selected or not. It seems large data sets are not handled well in terms of memory usage. This is with knime 2.4.2.

I tried to cluster 7500 fingerprints and this worked with about 500MB of memory. The node stores the complete distance matrix in memory (well, actually only half), so it is quite easy to estimate the amount of memory needed: 7500*7500/2*8=214MB for the distance matrix. Then add some memory for the created dendrograms so 500MB should be enough. It also shouldn't make a difference which version of KNIME you are using, nothing has changed between 2.4 and 2.5 in this respect.

I've tried this task in isolatation (i.e. a fresh workflow) and it works much better. I suspect I just have too much going on in my workflow.

i.e. there were numerous clustering metanodes containing around 1million rows, then Parallel Universe metanodes with 7500 rows, some Matched Pair metanodes (1million rows), and then this Hierarchical Clustering metanode of 7500 rows. I think the workflow must be maxing out the memory and I need to split it into separate workflows.