Similarity search, bins and murcko scaffolds

Roxcapo · September 10, 2021, 11:47pm

Hi there,

I am new to KNIME and need some help.
I have two groups of molecules and want to find their similarity, select the one that are most similar/dissimilar and obtain the murcko scaffold of the most dissimilar ones.

I converted my molecules in CDK, used the Fingerprint node to obtain their ECFP4 fingerprint (for both set of molecules) and then used the Fingerprint similarity node to compare my molecules. I set the node to obtain a matrix of values. How can I bin values >0.8 and obtain the murcko scaffold of the compounds?

Thank you

elsamuel · September 11, 2021, 5:24pm

Welcome to the forum, @Roxcapo.

Here’s an example workflow that (I think) does what you want. At the very least it’s a starting point for you to play around with.

There are actually 2 workflows that do subtly different things:

For every molecule in Group 1, return the single most similar molecule in Group 2, where the Tanimoto similarity is greater than some specified value
For every molecule in Group 1, and return all pairs where the Tanimoto similarity is greater than some specified value

Roxcapo · September 12, 2021, 8:15pm

Thank you, that really help!!!

aworker · September 13, 2021, 9:43am

Hi @Roxcapo and welcome to the KNIME forum.

Complementary to @elsamuel solution based on CDK nodes, which is perfectly valid, I would like to add an alternative solution.

This added solution is based on the RDKit node to calculate the fingerprints and the -Similarity Search- node implemented by KNIME to calculate similarities. Both are more or less equivalent.

Having said this, I found by experience this second one based on RDKit & the KNIME similarity node to be faster (for fingerprint generation & similarity calculation) that the initial one. Besides this, the -RDKit Fingerprint- node gives more options and freedom to generate the fingerprints. For instance, the -Fingerprints- node by CDK does not allow to define the number of bits to set to store the fingerprints, which may eventually generate hashing collisions is you have to generate fingerprints for a big number of molecules:

CDK -Fingerprints- node options:

RDKit -Fingerprint- node options:

Here below the example by @elsamuel completed with this alternative option:

Similarity Search Between 2 Groups of Molecules II.knwf (633.9 KB)

I’m adding a -Timer Info- node to control and check the time every node takes to execute, so that you can compare between the two solutions, or others in the future. In this example, you will not see much difference but time execution difference may get important if you have big sets to compare.

Hope this helps.

Best

Ael

Roxcapo · September 13, 2021, 8:39pm

Hi Ael,

thank you very much for the extensive explanation. I will try both solutions

Cheers,

Rox