RDkit Substructure Counter node

In the RDKit Substructure Counter node there is a new option "Use Names in the Following Column as Queries". But this is not mentioned in the node dialog.

It appears to allow substructure counting of IUPAC names which is nice!, but the node will not work if this is the only column you connect to the 2nd in port. The node requires a RDKit column in the second inport even though it doesnt appear to be needed when using this new option.

Simon.

Also with the "Unique Matches Only", I would assume that the substructure can only use the same atoms once in the input molecule. But I am getting a substructure counted twice, using an atom more than once.

For instance, if the input molecule is 2-methyl morpholine, and the query molecule is "methanol" using the "Use Names" option, I get a count of 2, which is fine. But if I select "Unique Matches" I still get a count of 2, should this not be 1, as there is only one oxygen atom in the input molecule, so how can two unique counts be done as the Oxygen will be counted twice now and so isnt a unique match.

Simon.

Simon,

Uniqueness is determined by all atoms in the query. So if you use C-O as a query and check for matches against C-O-C you get 2. The first is atoms 1 and 2, the second atoms 3 and 2.

You seem to be looking for non-overlapping matches. This is not currently supported, but it could be added if there were a strong case for it.

Best,

-greg

Thanks for the clarification on "Uniqueness", I wasnt entirely sure what was meant by this. At the moment I have no use case for non-overlapping matches, I was just mistaken by what was meant by "Uniqueness".

And refering to the first post, do you observe the same confusion in the node config, when selecting IUPAC names for the substructure search.

Simon.

Hi Simon,

sorry, I forgot to explain the new option in the documentation. The new column that you can specify, which is only used when the flag is switched on, is not meant to act as a new type of query as you thought, but its purpose is to use the values in that column as header titles for the count columns. So the idea is basically to have in the table that defines the queries as RDKit Molecules (which can be based on SMILES, SDF or SMARTS) an additional column that names the queries. Especially when you deal with SMARTS it is nice to use chemical term that names the result count of a SMARTS query.

I will add the missing documentation in the near future.

-Manuel

Based on an earlier suggestion by Greg based on this post: http://tech.knime.org/forum/rdkit/rdkit-descriptor-node-request#comment-27921, I tried annotating compounds using a1aaaaa1 and A1AAAAA1, but the 2 smarts patterns seem to be treated as the same, both aromatic. 

Thanks,

Natasja

 
 

HI Natasja,

I can't reproduce that. How did you see this behavior?

The attached workflow uses a substructure counter and the two different patterns. When I run it with the nightly build I see the correct results.

-greg

 

Hmm, it does seem to be working. I'll have to double-check in my own workflow as well. I was typing the column in the self-created table as smiles, but that doesn't appear to make a difference in your workflow, so maybe I was just imaging things.

Why is the query for the aliphatic one drawn with double bonds though? The aromatic one appears to be missing a dotted bond as well.

Thanks,

Natasja

 

Hi Simon,

based on your first feedback I made some changes in the RDKit Structure Counter node, which is available in the latest nightly build. Hope it explains now better, what you can do with it and what it is doing. Thanks for helping us to improve the RDKit Nodes!

Kind regards,

Manuel

Thanks for the changes, this is much more clear now as to what the column name is used for, and is very useful!

Simon.