RDKit Substructure Filter - treatment of hydrogens

Dear Greg (et al),

I have been having some issues using the RDKit Substructure Filter node - that have raised some questions that I thought I would share!

I am working with the DrugBank small molecules set (6595 mols - filtered to 6453 when converted to RDKit), and want to allow easy substructure searching via sketching in the MarvinSketch node.

Should be easy(!), as the MarvinSketch node will output SMARTS format.  But two problems can raise their heads: aromaticity, and explicit Hs...  The first of these can be fixed using the Indigo nodes (->Indigo; Aromatizer; ->Mol; Marvin MolConverter->SMARTS).  However, any explicitly-drawn Hs at this point seem to give 'undesired' results.  Example for 'phenol' substructure:

  1. Hand-written SMARTS:  [O;H1]c1ccccc1 - gives 613 matches
  2. kekule-drawn Phenol (with explicit H, and the aromatizer process above) - gives SMARTS [H]Oc1ccccc1 and 0 matches

I guess this is down to the way that RDKit defaults to storing molecules (with implicit Hs) and not due to any inherent problem with the SMARTS... In this particular example I happen to be starting with explicit Hs present in an SDF column, but these are removed when converted.

In RDKit (python), I think the following demonstrates the issue:

>>> mol = Chem.MolFromSmiles("Oc1ccccc1")

>>> query = Chem.MolFromSmarts("[H]Oc1ccccc1")

>>> mol.HasSubstructMatch(query)


>>> mol = Chem.AddHs(mol)

>>> mol.HasSubstructMatch(query)


So, unless I have got confused somewhere along the way, three possible solutions are:

  1. Have an output option for SMARTS in Marvin MolConverter to convert all explicitly drawn Hs to be hydrogen-count features of their bonded heavy atom (this would be great!)
  2. Have an option to maintain explicit Hs on incoming molecules when converting to RDKit (but this would need them to be there in the first place, and would not work with SMILES input) - or incorporate AddH / RemoveH functionality into the Molecule->RDKit node
  3. Add an AddHs RDKit node (and a corresponding RemoveHs node) for these situations (much like the Indigo approach)

(1) would be a great addition - but I guess is one for the Marvin thread...  Out of the other two options, I would favour option (3); and would also welcome RDKit versions of Aromatize / Kekulize (I guess this would be 'SanitizeMol' and 'Kekulize' in RDKit terminology?)


Kind regards


Just thought about this a little further, and realised that the inclusion of a 'removeHs' checkbox (default to 1) in the 'To RDKit' node may be the most 'RDKit-like' way of handling both issues(?)  This would mirror the argument that is used in the full toolkit when converting to RDKit molecules from other formats that may have explicit Hs.  Of course, the RDKit sanitisation takes care of the aromaticity! [on that note, I only just noticed the 'partial sanitization' options in the 'To RDKit' node - so I am starting to wonder if I am already able to do what I am asking if I just read the manual a bit more..?!]

Anyway, I guess what I am imagining is actually the RDKit Molecule Substructure Filter, with the upper input going through an AddH node to convert implicit Hs to explicit; and the lower input being an RDKit query molecule(s) with the RemoveHs option unchecked.

Going back to the phenol example - if the OH was explicitly drawn (in the query), then I think the above virtual scheme would eg only match phenol (c1ccccc1O) and not anisole (c1ccccc1OC).  At the moment I seem to always match both.

**apologies for adding even more to what was already a long first post!  : )  **

Kind regards


James: It took longer to read your post than it did to fix the problem. ;-)

This was an easy one, and it's fixed. It should be in tomorrow's nightly build.

For those who care about what was going on: There are multiple options in the RDKit for constructing a molecule from SMARTS, none of which are exposed to Knime. The behavior that was being used was "leave the H in the molecular graph", which meant that an H had to actually be in the input molecule table in order for a match to be found. There's not currently a way to make this happen effectively on the RDKit side, so this is wrong. The correct behavior here is to see that H in the SMARTS and to merge it into the neighboring O in order to make that O query: [O;!H0]. This is what I just checked in and I think it should solve the problem.



Hi Greg,

Thanks for this update. It's made a significant difference in the number of matches that we get for the SMARTS PAINS filters.

But we're now seeing a lot of false positives too.

Could I send you a list of the mismatches so you can diagnose what's going on?

(the other) Simon


I just got back from vacation this weekend and am still getting caught up. Please do send me the mismatches (ideally to the rdkit-discuss mailing list so that other can comment/find the results: https://lists.sourceforge.net/lists/listinfo/rdkit-discuss) and I will take a look at the problems.


Quick update for the forum. RDKit now matches 636 structures, up from 329 reported in our paper, but matches 2 additional structures not matched by the SLN filters.

See discussion on RDKit list: http://goo.gl/NVIgy

(the other) Simon