SMARTS Query / Substructure Search - Outputs differ?

Hi,

In the both the CDK "Substructure Search" and the "SMARTS Query" entering "c1cnc[nH]1" into both nodes gives very different outputs; e.g. SS 95 hits; SQ 32 hits (out of 2500 molecules).

If I simply draw imidazole in the SS node, I get the same 95 hits.

In the SS node, the tools->Create Smiles gives "C=1NC=CN1", if I put this string into the SQ node I get zero hits?!

Is there any method to determine what smiles/smarts string the SS node is actually using; I assume the SS and SQ nodes are using the same CDK search function internally?

Cheers,

Steve.

Hi Steve,

my apologies for the delay in reply. Your input SMARTS c1cnc[nH]1 does not match the substructure imidazole SMILES. If I understand SMARTS correctly:

The SS node matches the defined substructure, i.e. the imidazole ring (hydrogens and double bonds can move). In contrast, the SQ node matches the imidazole ring where the nitrogen must have a hydrogen [nH]. Instead, you could try c1cncn1, which is less specific.

I have found some cases where the SS node gives less results than the SQ node using c1cncn1. Specifying the connectivity resolves this clash: c1cnc[nX3]1 (imidazole where one nitrogen must have three connections. I have to double check if that's a bug in the SS node.

Cheers,

Stephan

If I put c1cncn1 into the SS node, I get "Errors loading flow variables into node : Unable to read fragment".

c1cncn1 in the SQ node gives 95 hits - same as a drawn imidazole structure in the SS node.

c1cnc[nX3]1 gave 93 hits in SQ, and no hits in SS. The SQ node misses 'Dimenhydrinate' and 'Oxytriphyline' both of which include theophylline derivatives.

I believe that c1cnc[nH]1 is imidazole with at least one of the ring nitrogens unsubstituted.

Confusion over the meaning of the SMARTS/SMILES string not withstanding the differences between SS and SQ nodes seem a little odd.

BTW, I've set knime to update from the nightly repositories, so should have the latest version.

Cheers,

Steve.

 

Hi Steve,

SMILES/SMARTS are definitely a bit tricky. For the meaning of either, it's good to consult Daylight's theory pages on SMILES and SMARTS.

The Substructure Search node only accepts valid SMILES strings. c1cncn1 is not a valid SMILES string. Daylight's depict server identifies it as SMARTS/SMIRKS and won't draw a figure.

It's not a valid SMILES string because of issues related to hydrogen placement. For more information, please see this blog post by my colleague

Unfortunately, the JChemPaint application we use in KNIME-CDK does not support SMARTS entries. Thus, neither c1cncn1 and c1cnc[nX3]1 will work.

The SMARTS Query node should match Dimenhydrinate and Oxytriphyline for c1cnc[nX3]1. SMILES is tautomer specific though. If I take the Pubchem Compound entry for Cholinophyllin, I don't get a match because both nitrogens have only two sigma bonds (X2 rather than X3).

"I believe that c1cnc[nH]1 is imidazole with at least one of the ring nitrogens unsubstituted." For my understanding, that means that one nitrogen must have a hydrogen?!

Thanks,

Stephan

OK, I see why c1cncn1 might be ambiguous as far as SMILES is concerned.

Cheers,

Steve.