R-Group Decomposition

richards99 · May 6, 2011, 8:02am

Hi,

I am having quite a few issues with the R-Group Decomposition node, which is the node which most excites me to do Free-Wilson SAR analysis. If I put in a query molecule to the second port and there is just a single molecule which does not fit to the query molecule then the node fails with the error "R-Group Decomposer Execute failed: R-Group deconvolution: no embeddings obtained". Is it possible to have a third out port which shows all the molecules which did not fit the query molecule and make it so the node does not fail but continues to process the remaining molecules.

Also how do you specify more than one scaffold. For example I have a 5,6 aromatic core as the scaffold (i.e. an indole), but also have some cores with an extra Nitrogen present in the core. I labelled these as an A atom, but the node still failed with the same error message as above. I also tried putting more than one query molecule into the second port (i.e. Indole, Indazole etc) but I'm unsure if these extra query molecules are taken into account. Ideally I would prefer to be able to list multiple query molecules into the second port.

Is this possible ?

Thanks,

Simon.

dpavlov · May 6, 2011, 3:32pm

Simon,

Sure, we can add the "unmatched" output to the RGroup Decomposer node. It will be there tomorrow.

As for the error with the "A" atom: we checked that on our side, and found out that there is a bug! Thank you very much. We just fixed the bug and the RGroup Decomposer node will do better with "A" atoms tomorrow.

Regarding your suggestion about allowing multiple cores into the node, I would say that it makes things very complicated (selecting which scaffold to match, which R-Groups numbers to assign, etc). I hope that the "unmatched" output and "A" atoms handled well will solve the problem for you. All in all, you can use multiple RGroup Decomposer nodes if you have multiple scaffolds (or use a single node and loop over it, just as in the indigo-tpsa-example).

Best regards,

Dmitry

dpavlov · May 6, 2011, 3:38pm

Simon,

I would like to revoke my promise about the "unmatched" port being added into the RGroup Decomposer node :) Because I just figured that you can easily get the same result by passing your table through the Substructure Matcher node. It has that second port. No need to reimplement that functionality in the RGroup Decomposer node then. What are your thoughts?

Best regards,

Dmitry

richards99 · May 6, 2011, 4:08pm

I would still like it in the R Group Decomposer node so long as it is not alot of effort. Its just that if you are using the RGroup Decomposer node, you are unlikely to be using the Substructure Matcher node at the same time, and it just helps the user to see how encompassing their query scaffold is, so they can modify the scaffold if desired if they notice alot going to the non-match port. They will be able to view the non-matches and see where they have gone wrong in not getting their query scaffold to match. (If all that makes sense!!)

Simon.

dpavlov · May 7, 2011, 9:15am

Simon,

I started to implement this second port, but I could not force myself to proceed. This just does not seem the right thing to do. I perfectly understand your point, but the problem is that Indigo's R-Group Decomposer internals would not list the "unmatched" structures by themselves. I would need to perform the substructure filtering separately. Which would (1) be redundant in the case when we find the scaffold with the Scaffold Finder node, and (2) require to duplicate the Substructure Matcher options in the R-Group Decomposer node, as soon as there appear any specific options.

So I suggest that you use the Substructure Matcher filtering for your task. Good news: it is possible to create so-called "Meta node" to have the Substructure Matcher and R-Group Decomposer node merged together. This is KNIME's feature not specific to Indigo.

As for the bug with the "A" atom in the scaffold, we have fixed it, and the fix is available in today's build 923. You can take a look at the example workflow that I have attached to this message to see how it works.

Best regads,

Dmitry

indigo-rgroup-decomposition-example-2.zip

richards99 · May 7, 2011, 4:05pm

THanks very much for the information.I accept your point on not adding the extra out port for non-matched structures, I wouldnt want the node to become less efficient by adding the extra programming. The provided workflow is most helpful.

The R-Group Decomposition node is going to be very useful for analysis of individual functional groups by using the GroupBy node to collate all the same functionalities together and looking at mean, max,min activities etc, but first they will need to be converted to conicalised smiles!

However, when I used the Indigo to Query Molecule node on the outport of the RGroup Decomposition, there is no option for conicalised smiles, only smiles is available. Is it possible to add this, otherwise the same functional groups may have different smiles string patterns and therefore the GroupBy node will not work as desired. I notice the Indigo to Molecule node has conicalised option but this can not deal with the RGroup Decomposition node output.

Also I noticed the connection point of the functional groups is lost when using Indigo to Query Molecule, it is replaced with a Hydrogen atom. This is a real shame, as for example you lose the position a pyridine was connected too. Is it possible in the Indigo Query to Molecule node to have an option on how to handle non-standard atoms (i.e. the connection point) where you can type in a default atom to replace any non-standard atom with such as H (as is currently the default), Br, I, or A, etc. This then retains the connection point, and being able to select Br for example is also useful as this could represent a starting material reagent to prepare this substitution pattern. Selecting "A" is useful to represent a connection point.

And finally, in the RGroup Decomposition output, besides the "R-Group x" columns generated for the functional groups its removed, can there also be an extra column called "scaffold" which just has the central core it detected, as it is with no substituents on it, without Rx groups or A atoms in it which shown in the second outport of the RGroup Decomposition node. (So in the example you sent me it would just be an indole, or indazole). That way the cores can then be grouped together with the GroupBy node also.

If these changes could be made, it will be a very powerful tool for complex analysis of functional groups and looking at One Point Changes in SAR.

Thanks

Simon.

dpavlov · May 9, 2011, 6:28pm

Simon,

Thank you for so very much detailed comment.

Yes, we are having an inconsistency: R-groups are saved as "query" molecules, which shuts out the canonicalization. We are going to fix that and save them as non-query molecules, so you will be able to save them as canonical SMILES.

You said that the attachment points are lost: this is true for "SMILES" format, but if you change the format to "Mol" (Molfile) in the node options, then the atachment points will be properly saved. I agree that we should include the attachment points in the SMILES format too, though. I hope that we will be able to do that in the next Indigo update.

I have also understood your point about putting the "matched" scaffold part of the molecule near its R-groups; we will try to implement that too.

Best regards,

Dmitry

richards99 · May 9, 2011, 6:38pm

Many thanks for your quick response.

You mention that if its converted to a "Mol" file the attachment points are kept. Unfortunately as a "Mol" file, I then wouldnt be able to use the "GroupBy" node to do this complex SAR analysis per functional group. This is really why I would like to be able to convert these R Group Decompositions into canonicalised smiles to do this job, so I can guarantee that the same functional group has exactly the same cell contents.

I hope you dont mind all these suggestions and comments. I am very excited about the Indigo nodes, and am trying to give suggestions to make them even more functional than they already are to really empower the chemists out there!!

Thanks

Simon.

dpavlov · May 19, 2011, 9:45am

Simon,

The updated R-Group decomposer node is available in today's nightly build. It saves the R-Groups as non-query molecules, which makes possible to save them into canonical SMILES. Another good news: attachment points are kept in the (canonical) SMILES now. Also, we have added the "Scaffold" column that you requested.

If you would like to discard R-sites or attachment points from structures, you can use the Feature Remover node for that.

Best regards,

Dmitry

richards99 · May 21, 2011, 10:42am

Thanks Dmitry,

This node is now really exciting, I can do so much SAR analysis with this. Combining this node with the Indigo to Molecule (as Canonicalised Smiles), and KNIMEs GroupBy node gives very powerful trends and analysis of functional groups around a core.

I really appreciate the changes you have made, its resulted in the functionality and application of this node to be drastically increased.

One question I do have, is when I used I used the Indigo to Molecule as Canonicalised Smiles on the R Group columns, the resultant smiles string has a strange set of characters at the end such as " $;;;;;;_AP1$ ", is this normal ?

Thanks,

Simon.

richards99 · May 21, 2011, 10:47am

Hi Dmitry,

Another question, in the Decomposer settings is the "Maximum R Groups", is it possible for this to be automatically determined ? As at the moment you risk not having enough columns and lose the data, or you specify to many and have empty columns. I am unsure how this Maximum R Groups setting is useful, I am sure everyone would prefer this to be automatic.

Thanks,

Simon.

dpavlov · May 21, 2011, 11:27am

Simon,

This "_AP1" thing is the ChemAxon's notation for the attachment points that we have adopted in Indigo too. So it is perfectly normal :)

If you would like to get rid of the attachment points notation, you can remove the attachment points using the "Feature Remover" node.

Best regards,

Dmitry

dpavlov · May 21, 2011, 11:32am

Simon,

I agree that this better be automatic, and the core Indigo library is entirely capable of doing this. But unfortunately, KNIME requires from its nodes to tell the exact number of output columns on the "configuration" stage, before the actual processing of the input data is done. As the number of R-group columns clearly depends on input data, we can not do it, sorry. Take it as a platform limitation...

Best regards,

Dmitry

richards99 · May 22, 2011, 10:13pm

Unfortunately I'm not a programmer so I am unable to answer properly, but are you sure this is the case. There is an RDKit Decomposition node with variable column outputs without the user specifying, also the MOE Free Wilson node has variable column outputs, and also the KNIME transpose has variable column outputs. Are you certain it cannot be done ?

Simon.

dpavlov · May 22, 2011, 10:22pm

Simon,

Thank you for the information, it makes sense. I will take a look soon.

Best regards,

Dmitry

richards99 · May 28, 2011, 7:25pm

Thanks for the improvement to the R Group Decomposition node to automatically detect the number of required output columns, this is much more user friendly.

Thanks for doing the same improvement to the Component Separator node too.

The node descriptions for both "R Group Decomposition" and "Component Separator" now need the "Maximum Columns" section taking out now that its automatic.

Thanks

Simon.

dpavlov · May 28, 2011, 8:42pm

Simon,

Thanks very much, you were absolutely right. There is a possibility to pass "null" DataTableSpec to KNIME engine on the configuration stage, and create arbitrary number of columns later. The updated version (1.0.0.0000948) is available. It creates the needed number of columns automatically now.

Best regards,

Dmitry

BJFR · June 19, 2011, 11:49pm

Dmitry,

Now that the R group node works with technically no limit as fat as nb of R group is concerned, would that be possible to have one that allow to enforce as well the nb and the localization of the R variable? I mean that I’d like to impose where Ri groups are on the scaffold I use to match my compound table. Unlike scaffold hopping or diversity analysis, focused SAR analysis is more often based on either common scaffolds or spatially related scaffolds that may only have the Ri groups in common. Such ability would allow the alignment of R groups what ever the scaffolds studied are, then compute all kind parameters.
The current r group decomposition is great but requires examining and renaming each R group prior concatenating the data of scaffold 1 and data of scaffold 2 which can prove to be difficult for R groups ids can change.

Thank you,
Bruno

richards99 · June 20, 2011, 6:55am

I agree that is something that would be powerful as all the different R group Decomposition type nodes suffer from the same issue. It therefore means there always needs to be a manual intervention in the workflow using the Rename node. It would be great for to be automated.

Note however, If all the scaffolds are the same connection of atoms (i.e. all 6,5-ring systems) then you can use "A" atoms in the scaffold query molecule that you input into the R Group Decomp node, that guarantees the naming will be uniform across the scaffolds that are covered under that query.

Simon.

BJFR · June 20, 2011, 10:41pm

Simon,

you're right, R decomposing two scaffolds that are not structurally related requires manual intervention, either at the beginning (I decide where Ri to Rj groups are in all my queries) or at the end (rename Ri to Rj where ever they appear -> Ra to Rb) thus unfifying the outputs vs the R groups angle where they were all different from scaffolds.... Experience show that it's more convenient and efficient as well to do it upfront, hence my request ;-))

I'll try the pseudo atom trick and let you know.

regards,

bruno