3D rmsd of substructure vs larger molecule

Dear all,

I would like to calculate RMSD values of 3D coordinates of a small molecule / fragment versus larger compounds containing the fragment as a substructure.

more in detail:
let’s say I have a list of molecules with both SMILES as a column and 3D coordinates in SDF as a column. These are results of a docking workflow. That means the column contains many poses of different molecules, but also several poses from the same molecule. All these molecules are about 300-450 Dalton and contain a common substructure that is the size of a small fragment, 120-200 Dalton. Also for this substructure, the fragment, I have 3D coordinates as SDF, found experimentally via X-ray crystallographic screening.
What I want to know:
how big is the 3D RMSD of the fragment vs the same atoms inside the bigger molecule?
In order to use that for further filtering, i.e. which docking poses moved far away from the input fragment coordinates.

my unsuccessful tries so far:

As far as I have tried the RDkit RMSD filter cannot do it, as
a) I cannot give a reference molecule
b) computes the RMSD between the entire molecule as far as I tested
(thus I would have to cut back the large molecule to the fragment (=MCS), but how to do that in KNIME? )

secondly I tried the KNIME the 3D RMSD node of CDK,
similar problems as with the other, plus
c) does not seem to give pair-wise RMSD but just overall RMSD in the whole column

I have a machine with 64 CPU (128 threads) and it should work on about 1 million poses or so in a reasonable time (e.g. 2h).
So as long as the solution can be parallelized, should be fine.

Has someone here maybe already dealt with a similar problem before and can help me, or point me in some directions?
I know it is a quite detailed problem (at least for us), but any tip is very much appreciated.

best regards,

I’m not really aware of a way of doing this directly at the moment, however, it ought to be possible to achieve this in one of 3 ways:

  • Using the RDKit Python API within a python snippet
  • Using the RDKit Java API within a Java snippet (there is some discussion of how to access the RDKit API in this manner here and here)
  • Writing a custom node based on the RDKit API

(These are at least vaguely in order of increasing deviation away from most people experience of using RDKit and/or KNIME!)

Most of the bits required are performed one way or another by our Templated Conformer Generation node (see https://kni.me/n/wK3RJiystQYq5M9w), but not in a way that you can access directly. The source code might, however, give some clues as to how to implement using the Java API.

I think the key steps you would need are:

  • Find the MCS
  • ‘Paste’ the coordinates of the MCS match from each half of the pair onto 2 copies of the MCS structure (or delete all non-matching atoms from each structure)
  • Calculate the RMSD

I hope that helps


Hi Steve,

thanks a lot for your tips,

I was looking a bit more into the MCS node of RDKit, but unfortunately could not find the coordinates info in the output. I have to say that I am not a programmer, so creating my own nodes or writing something myself in Java/Python is simply out of reach for me.
Thought that maybe it could be solved with a complex/creative combination of nodes or so. Hopefully more people would benefit from a solution to my problem, so that it might be integrated in some future plans of the developers.

From the 3 key steps you suggested, maybe one would not even need to find the MCS, as all my molecules definetly contain all the heavy atoms of my reference molecule. So somehow it could be enough to “highlight” the corresponding atoms inside the query molecules and get their coordinates. From that one could create the “sub-molecule” in SDF and then use this for comparing to the reference via the 3D-RSMD node that already exists. … but again, just a therorethical idea. I don’t know how to do it, neither in KNIME nor via other tools.

best regards,