RDKit: R group Decomposition

Thanks for inserting a new node "R Group Decomposition". This works really fast and is potentially very powerful, however I would hope there is room for some improvements to be made. Firstly, is it possible to insert a list of core scaffolds as SMARTS to insert, as most chemical series will have multiple core scaffolds rather than just one, it is quite limiting to be restricted to one core scaffold SMARTS.

Additionally, is it possible to define the connection points on the core (in the SMARTS string). At the moment, each substitution area is randomly assigned an Rx number for the column name. It would be useful if you can draw R1, R2, R3 etc on the scaffold SMARTS string, and then the column name in the output corresponds to this R number. This way there is no need for manual intervention later, to rename the Rx columns to correspond to how the chemist wants them. This becomes more important when (and hopefully possible) it is possible to input multiple scaffolds as it needs to be defined that each scaffold uses the same Rx number at the same position of the SAR for example so each scaffold groups match up. Also besides the Rx columns in the output, please can you have an additional column called "Core" or "Scaffold" which shows the core structure too (without any R groups around it), again this will be important if multiple cores can be used.


Thanks, Simon.

I've also noticed in the output that Ax is used as the connection point for the groups. For the R1 column A is used, R2 column A1 is used, R3 column A2 is used. This is a little confusing because the numbers of the Ax are out of sync with the Rx column, can this be unified please.


Sorry, another addition that would be useful would be to pass unrecognised molecules to a second outport. This is useful to see which molecules were not captured by the SMARTS scaffold string.


Hi Simon,

Thanks for the feedback.

As you’ve noticed, the R-group decomposition node is currently a rough “first draft” of what it should be. I’ll include your suggestions in the list of changes that ought to be made and fix the bug with connection points.

With regards to allowing the user to specify R group labels: how would you suggest handling the situation where a molecule has a substituent at an unmarked position? i.e. the user has marked R1, R2, and R3, but then a molecule has a side chain at another point on the core?


If there is a group at another position which wasnt specified then I would have that molecule ignored, as it wouldnt fit in with the "substructure" of the core drawn in which the Rx groups can be anything. Essentially I would see it where it must match the core drawn and only allowing further substitution where there is an Rx group drawn. I'm really excited about this node it can be really powerful for looking at Match Pairs (i.e. where only one substitutent changes).

With regards to allowing multiple cores to be selected, I was thinking the best way of implementing this would be a second input node which takes rows of SMARTS strings for the multiple cores.