The nightly builds of the RDKit nodes (should) now include a node for picking a diverse subset of rows from a table. The node is currently fairly crude and only can use Tanimoto similarity and a fingerprint column, but I thought that it would be useful anyway.
That's excellent news - thanks! This is something that I have been wishing for for some time, and I think will be of real value to the community. Out of interest, is the node implementing maxmin functionality from rdkit.SimDivFilters.MaxMinPicker.Pick ?
Yes, the node is doing the diversity picking using the same backend code as the MaxMinPicker. For those not familiar with the core RDKit functionality, it's the algorithm described in this paper: Ashton, M. et. al., Quant. Struct.-Act. Relat., 21 (2002), 598-604. The practical advantage over standard clustering-based approaches is that the MaxMin algorithm scales much better with dataset size (you don't need to pre-compute the distance matrix).
I've found this node extremely useful too, thanks for making it available - one quick Q if I may, are there limits in the number of compounds it can handle/select? Is selecting a subset from a few hundred thousand problematic(/dumb)?
I want to use the Diversity Picker node essentially as a method to pick molecule cluster centroids, but then (maybe obviously) would like a way of assigning other molecules based on which centroid they are closest to.
I am currently struggling to accomplish the cluster assignment part in KNIME - I started by trying to loop over the full compound set once for each 'centroid' and used the Indigo Fingerprint Similarity node to generate a distance for each molecule to each centroid. With a column rename and Loop End (Column Append) finish I can also get this distance information into a new set of columns, with each column now named based on the centroid row ID. So I thought I was almost there - all I want to do is record a 'winner' out of all of the columns and place the column name as the result. I considered the Rule Engine to do this, but I think this will fall down for different number of clusters / different row IDs.
Am I missing something - it seems I may be making a simple problem difficult?! If not, would the Diversity Picker node be the place to add such cluster assignment functionality, in addition to picking the diverse molecules?
For such problems I use the MOE Fingerprint calculation node, pipe the results into a "Distance Matrix Calculate" node. When you switch to tanimoto you can select the fingerprint column. Then the k-Mediods node does the trick. It gives you the cluster centroids, the cluster assignment and the distances to the centroids.
Another option if you have the centroids already is the "Most Similar Molecule(s)" node in the MOE extensions. You pipe in the big database and the centroids as reference. Set the threshold to 0 and N to 1. This will give you a table where for each molecule you have the corresponding centroid molecule. This can easily be split into clusters if needed.
i was going to suggest the same. I use the rdkit or indigo fingerprint node followed by the distance matrix calculate node with tanimoto distances. I find the Morgan fingerprint of rdkit serves me best.
you can then use the k-medoid node to get clusters around a centroid.
alternatively, instead of the k-medoid node you can use heirarchical clustering node followed by the Heirarchical cluster assigner node which works rather well.
A further option is to use the BitVector converter node from erlwood after the fingerprint generation to create lots of binary columns and then use the c-means for clustering. This is actually an example as part of the complex sar workflow on the knime public server.
Yes, I am aware that the clustering (given fingerprints) can be accomplished via k-medoids or hierarchical clustering nodes. However, I was particularly interested in using distance from the Diversity Picker results as the method for assigning cluster membership - I will take a look at the suggested MOE node.
James - to solve your problem I’d try unpivoting the output from the loop end (column append) output. You should then be able to sort and/or use the GroupBy node the output to find the most similar reference per compound, rather than trying to do a cross column comparison. Hope this helps, obviously may be missing something without seeing your workflow.
Dave
It shouldn't be that difficult to add the ability to assign molecules to clusters based on their distances from centroids.
The question is what the output would look like... I guess the node would need to add two columns when that option is set: "cluster membership" and "is centroid?"
Hi Dave, thanks for the suggestion - I will look into it because I am interested(!), but actually the CDK FP similarity node offers a very quick and convenient solution - see http://tech.knime.org/forum/indigo/suggested-modification-to-fingerprint-similarity-node
The solution is 'generic' in the sense that any FP is ok, but I still think there are some easy improvements for a generic similarity node (that handles multiple references) - like offering different similarity metrics.