RDKit Fingerprint Issues

Hi,

I dont know if I am mis-using the RDKit Fingerprint node, but when I use either Morgan, FeatMorgan, AtomPair, or Torsion options, and then compare these fingerprints to the fingerprints of other very different molecules, I am getting very high similarity results when using Tanimoto as a measure and this should not be the case. Only the Layered and RDkit seems to be giving sensible results with low similarity results.

Thanks,

Simon.

Hi Simon,

Can you please give an example of two molecules where you see this behavior? Either SMILES for the molecules or a knime workflow with the data saved (you can add attachments to forum posts now) would help.

Best Regards,

-greg

So for example, if you just take the reagents used in the Reaction Enumeration example workflow from the File Reader which you have posted in the KNIME Public Server and calculate the Morgan fingerprint with the default settings, and then take just an Indole ring and calculate the Morgan fingerprint of this. If you then use Tanimato similarity (using a node from Mike Bodkin - Erlwood) to see how the reagents from the Reaction Enumeration compare to the Indole ring then most of them return a similarty above 0.9. This cannot be right as lots of these reagents are not even aromatic. For example Ethanol has similarity of 0.95, and ethanol is nothing like Indole!

In this example, FeatMorgan shows 0.96 similarity for Ethanol.

AtomPair shows 0.97, Torsion 0.98!!!

Only RDKit (0.75) and Layered (0.71) show more sensible similarity scores.

Hope this helps,

Simon.

Hi Simon,

I can't reproduce this. I put together a small workflow that builds a few molecules from SMILES and then calculates the distance matrix.  Mike's Tanimoto Similarity node doesn't seem to be in the open source Erl Wood nodes yet, so I used knime's built-in distance matrix node.

Using default settings for either the Morgan or FeatMorgan fingerprinters (i.e. a radius of 2 and 1024 bits), I get a distance of 1.0 (=similarity of zero) between ethanol and indole. 

 

The workflow is attached to this post. If it doesn't help you discover what the problem is in your workflow, please add the SMILES for the molecules that you're using to the table and attach it to your reply and I'll take a look.

-greg

Hi Greg,

Thanks for the workflow. As you say, using the Distance Matrix does give more sensible results, when I use the same set with Mikes FingerPrint Similarity node I still get high similarity scores. I will follow up with Mike on this, the FingerPrint Similarity node score works well when using other types of fingerprints, so maybe its misreading these ones.

Thanks,

Simon.

Thanks for the help Greg, it has identified a bug in our own FingerPrint Similarity node which has misidentified certain types of fingerprints. I am assured the bug is easy to fix, so thankfully the RDKit Fingerprint node is not the issue!

Simon.

I'm glad to hear that, at least this time, my code is not to blame. :-)

-greg