How to extract data from distance matrix


I would like to determine the chemical diversity within a library, by calculating a self-similarity distance matrix (i.e. comparing all library members against all otthers), and then binning the simiarities into a histogram. This by itself is straightforward, by first calculating structural fingerprints (e.g. using RDKit nodes), followed by a Tanimoto distance matrix on the fingerprints.

However, the tricky bit is to extract from the distance column for each row the shortest distance. This would be straightforward if the Cell Splitter node could be used, but I see no way to convert the Distance column format into something else.

Coud this be done with a simple Java script?

Any suggestions greatly appreciated!

I think what you need is NOT the distance matrix node.

take the fingerprints as you mention, then use one of the fingerprint similarity nodes, available from Indigo and Erlwood community nodes.

in either of these nodes, connect your data set to both in ports, then choose Tanimoto for the distance type, and choose Max similarity (I.e. Min distance). 

Hope this helps


I think Simon is correct here. We also have a Similarity Search node in the KNIME core which is relatively new and may help as well.  In 2.9 we will have a distance matrix pair extractor which will doe what you are asking for in your original post. 

Regards, Aaron

Thx for the tips. I got the desired result with the Erl Wood Fingerprint Similarity node.

I also tried the Distance Matrix Similarity Search and the CDK Fingerprint Similarity nodes, but these give a maximum similarity of 1 for all rows (self-similarity of the matrix diagonal).

The results are -not surprisingly- highly dependent on the type of chemical fingerprints used.


i want to raise this subject again, and hope someone have an idea of how to deal with this:

i want to choose a diverse dataset ( cutoff of tanimoto=0.7) from a file, we used to use a script, but each time we have to prepare the files manually.

so if i used the nodes in Knime it makes it easier to the point that i have to choose my molecules, calculating fingerprints by CDK , if i use the  "fingerprint similarity" node it gives me the "average" for each molecule to the entire file, but this value dosn't say much !!

we used to compare a pairs of molecules, if they have a tanimoto>0.7 , we decide to delete the molecule that has more similarities with the entire file, simplly by suming the values of tanimoto for each, 

well, the qusetion is : how i do this filteration in knime??

any ideas ??


thanks alot:)



Well, if you want to pick a diverse set of cpds why not generate the fingerprints using RDKit or CDK or whatever your favourite fingerprint is and then use either;

RDKit Diversity Picker node which uses the MaxMin method to pick most diverse set. I have used this and compared it to other techniques. It works extremely well. 

Calculate distance matrix with distance matrix calculator node. Then use Score Erosion node. Again this works well and allows you to weight the molecules, such as in favour of drugable cpds if you have another column with a drugability score.



in both options you choose a number of the set to be filtered, itsn't like a tanimoto cutoff!! if i have a set with 100 molecules and in the "Diversity Picker" i choose a 100 to pick it will give me all the 100, so what i did here!!

if i don't want to define a number, i just want it to calculate the difference and pick the most diverse, is there a way to do that?

** I want to choose my train set here, so i can't define a specific number for the set size!!

another issue, in distance matrix calculator node, how can i keep the molecule names when creating the matrix?