Degradation of RDKit Molecule Substructure Filter node?

Hi guys,

I'm experiencing some problems with RDKit Molecule Substructure Filter node. I found that there are some SMARTS (one of which is reported in the attached example workflow) that give me the following error:

 

Execute failed: Could not parse SMARTS '[2H,3H,11C,11c,14C,14c,125I,32P,33P,35S]' in row Row0

 

The odd thing is that while these SMATRS give this error using the last RDKit version 3.3.0.v201712041142 they didn't do it in the earlier RDKit version 3.2.2.v201612161513.

Why this happens?

Thanks in advance

Sorry for the end-of-year-induced slow reply.

It's almost always a better idea in SMARTS to use #1 instead of H. If you change your SMARTS to:

[2#1,3#1,11C,11c,14C,14c,125I,32P,33P,35S]

things should work fine.

Note that you can also treat the C atoms the same way to make the SMARTS shorter (and faster to match):

[2#1,3#1,11#6,14#6,125I,32P,33P,35S]

Best,

-greg

 

Hi Greg,

Thank you for your answer. Could you please explain me what is exactly the problem with the failing SMARTS? Isn't it a valid SMARTS?

I'm asking you this because I took this SMARTS as an example but I identified a total of 13, published in literature and attached here as CSV, that suffer from the same problem. So I would need to modify all of them in the same terms.

I would appreciate very much your feedback on this.

Ah, sorry I forgot to include that bit:

I believe (though I still need to confirm) that these are valid SMARTS and will look into finding/fixing the RDKit bug that is causing this.

In the meantime: the form suggested above will work with the current version of the RDKit and is also generally safer.

-greg

 

An update on this one: I was wrong above. Although these are valid SMARTS, and the RDKit should parse them, they don't mean what you think they do.

The reason behind this is technical and not particularly logical (which is why I didn't remember the right answer). It's also not particularly well documented.

The best explanation of this that I have found is, illogically enough, in the Daylight SMIRKS documentation: http://www.daylight.com/dayhtml/doc/theory/theory.smirks.html

The relevant bit is the last line of the table in the attached image. "Unchanged" there can be understood to mean: "corresponds to a query for the number of attached Hs".

I still think that you are much safer *always* using #1 if you want to indicate an H in a SMARTS query.

 

 

 

Hi Greg,

Thank you for the detailed information. As I mentioned above, beyond the usage of “#1” for indicating an “H”, I identified a list of 13 published SMARTS (corresponding to significant structural alerts) giving the same problem (SMARTS file attached with my previous message). In some of these SMARTS there are no explicit hydrogen atoms so the problem probably is triggered also by other reasons.

It would be nice to understand what’s the problem and/or substitute these SMARTS with an alternative working version. Unfortunately I’m not an expert with SMARTS so my help with this issue is pretty limited.

Ok, here's a quick run-down of the problems I could find with the failing SMILES:

[CH2,$(CF2);R0][CH2,$(CF2);R0][CH2,$(CF2);R0][CH2,$(CF2);R0][CH2,$(CF2);R0][CH2,$(CF2);R0][CH2,$(CF2);R0][CH2,$(CF2);R0]

[F,Cl,Br,I,$(O(S(=O)(=O)))]-[CH,CH2;!$(CF2)]-[N,n]

[N,n,O,S;!$(S(=O)(=O))]-[CH,CH2;!$(CF2)][F,Cl,Br,I,$(O(S(=O)(=O)))]

These all contain "CF2" and that "2" is not legal. If you want -CF2 but not -CF3, then you need to do: "$(C(F)F);!$(C(F)(F)F)"

 

[[CH;!R];!$(C-N)]=C([$(S(=O)(=O)),$(C(F)(F)(F)),$(C#N),$(N(=O)(=O)),$([N+](=O)[O-]),$(C(=O))])([$(S(=O)(=O)),$(C(F)(F)(F)),$(C#N),$(N(=O)(=O)),$([N+](=O)[O-]),$(C(=O))])

This has a syntax error. I guess you mean:

[CH;!R;!$(C-N)]=C([$(S(=O)(=O)),$(C(F)(F)(F)),$(C#N),$(N(=O)(=O)),$([N+](=O)[O-]),$(C(=O))])([$(S(=O)(=O)),$(C(F)(F)(F)),$(C#N),$(N(=O)(=O)),$([N+](=O)[O-]),$(C(=O))])

 

[N,C,S,O]-&!@[N,C,S,O]&!@[N,C,S,O]&!@[N,C,S,O]&!@[N,C,S,O]&!@[N,C,S,O]&!@[N,C,S,O]

Not sure what you're trying to express with this one, but you need to have some kind of bond query before "&!@". If they should all be single bonds, this is:

[N,C,S,O]-&!@[N,C,S,O]-&!@[N,C,S,O]-&!@[N,C,S,O]-&!@[N,C,S,O]-&!@[N,C,S,O]-&!@[N,C,S,O]

 

ac-*=&!@*-&!@C(=O)&!@ca

Another missing bond query. I'm going to guess you meant:

[a]c-[*]=&!@[*]-&!@C(=O)-&!@c[a]

(I put the "a" and "*" in square brackets too, because I think it's clearer. It's not necessary)

 

c12cccc(C(=O)N(&!@C)C(=O)3)c2c3ccc1

Has a missing bond query.

 

([#6]OP(=O)(*)O[#6].[#6]OP(=O)(*)O[#6].[#6]OP(=O)(*)O[#6])

Too many parens. I guess you meant:

[#6]OP(=O)(*)O[#6].[#6]OP(=O)(*)O[#6].[#6]OP(=O)(*)O[#6]

 

*-C(=O)-&!@[NH]-C&!@C(=O)-&!@[NH]-*

missing bond query.

 

[#6,#7]&!@[#6](=&!@[CH])&!@C(=O)-&!@[C,N,O,S]

missing bond query

 

Si~O

"Si" needs to be in square brackets

 

For what it's worth, you should be getting error messages (not always easy to interpret, but there) in the console in KNIME for these problems.

 

-greg

 

Dear Greg,

Thank you very much for the SMARTS analysis you did. I appreciated that very much.

As I mentioned above these SMARTS represent interesting structural alerts and people may want to filter out molecules matching those. They are collected from different sources and published in ChEMBL database.

I will try to write and warn them that some of their structural alert SMARTS may be incorrect.

Thanks again for your effort.

Gio

Hi,

I just wanted to follow up on this.  I have a workflow using the published BMS SMARTS filters and I'm seeing a similar issue.  In this case it seems to be filtering out aromatic sulphur compounds (thiophene, thiazole, thiadiazole).  So for example the following smiles CN(Cc1ccsc1)C(=O)C, flags as metal_containing.  The SMARTS it matches is pasted below and taken directly from the paper (which could of course be wrong!).  I've tried this set in another application (vortex script) and this compound doesn't flag as metal_containing.

[$([Ru]),$([Rh]),$([Se]),$([se]),$([Pd]),$([Sc]),$([Bi]),$([Sb]),$([Ag]),$([Ti]),$([Al]),$([Cd]),$([V]),$([In]),$([Cr]),$([Sn]),$([Mn]),$([La]),$([Fe]),$([Er]),$([Tm]),$([Yb]),$([Lu]),$([Hf]),$([Ta]),$([W]),$([Re]),$([Co]),$([Os]),$([Ni]),$([Ir]),$([Cu]),$([Zn]),$([Ga]),$([Ge]),$([As]),$([as]),$([Y]),$([Zr]),$([Nb]),$([Ce]),$([Pr]),$([Nd]),$([Sm]),$([Eu]),$([Gd]),$([Tb]),$([Dy]),$([Ho]),$([Pt]),$([Au]),$([Hg]),$([Tl]),$([Pb]),$([Ac]),$([Th]),$([Pa]),$([Mo]),$([U]),$([Tc]),$([Te]),$([Po]),$([At])]

Any help would be appreciated.

Hi Angus,

Thank you for reporting this. You are right, in principle this structural alert published by BMS should match only metal containing molecules, nevertheless with RDKit Molecule Substructure Filter node it seems to match also molecules containing aromatic sulphur.

As you can see in the attached example workflow (i.e. RDKit_metal_containing_matching_smarts_problem.knwf) the same SMARTS is correctly handled by the CDK SMARTS Query node. Maybe greglandrum can say if this is a bug of the RDKit node.

Cheers

Hi,

Just came across this post, but cannot reproduce it. I use the BMS ‘contains_metal’ SMARTS straight from the paper, and the thiophene example passes the filter. This is in Knime 4.1.0 with RDKit KNIME integration 4.0.0.v201912021105.

Regards/Evert