Issues with RDKit Substructure Search

beginner · February 28, 2019, 7:49am

There is clearly an issue in RDKit nodes when substructure searching with H-saturated structures which are not converted to SMARTS. See attached workflow.

RDKit Substructure Issue.knwf (25.6 KB)

RDKit simply ignores explicit Hs at least at some if not all position when query is not SMARTS. In comparison indigo does it as expected.

Issue 2:

It seems that how aromatic bonds are drawn matters for the search results. (also see workflow). If they are drawn differently, then the structures do not match (unless Aromatizer node is used at least on the query molecule).

So if the queries come from say an sdf reader and never is converted to RDKit molecule (and /or SMARTS), then one will get confusing search results. The first issue will lead to too many results while the second one leads to too few results.

For me Indigo does it right here as it handles all the potential problems inside the search node. Works with H-saturated sdf queries as without aromatizing explicitly.

I think this is really important as results from such searches can miss important molecules.

greglandrum · March 4, 2019, 7:49am

Hi,
The two issues are related, in the sense that the RDKit is generally reasonably literal about what it does instead of trying to guess and “do the right thing” (there are several reasons for this, but the philosophy runs throughout the toolkit).

For issue 1:
By default the RDKit removes explicit H atoms from molecules that aren’t being processed as queries. This means that the H atoms you’ve placed in your mol file to flag substitution patterns are removed before the substructure matching is done. If you want to use explicit Hs in your queries to restrict possible substitutions, you need to use an “RDKit from Molecule” node with your query and set the “Treat as query” option. I’ve attached a workflow demonstrating this.
RDKit Substructure Issue part 1.knwf (13.4 KB)

In that example workflow I also edited your SMARTS query to remove the explicit Hs but still reflect what I think you intended.

For issue 2:
Since you have not set aromaticity in the sketcher before you export SMARTS, you get the following: [#6]-[#7]-1-[#6]=[#6](-[#6])-[#6]-2=[#6]-1-[#6]=[#6]-[#6]=[#6]-2. Note the explicit single and double bonds. The RDKit does not do aromaticity perception on queries read from SMARTS, so this will not match a molecule that’s been processed normally from Mol or SMILES (which do have aromaticity perception run by default). If you export the query as a Mol block aromaticity will be perceived and you won’t have this problem.

I hope this helps,
-greg

system · April 21, 2023, 9:10pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.