RDKit Diversity Picker and Score Biasing


The additional feature that was implemented a while back of the RDKit Diversity Picker to select against a cpd set is really helpful.

Is it possible to extend its versatility further by allowing the Diversity selection to be biased towards compounds which are maximised or minimised in a column property. i.e. So this could be activity, "Chemical Attractiveness", or "Lipinski Violations".

i.e. So rather than picking the most diverse compound, it picks one of the most diverse compounds which has a favourable property.

It would be useful to have the option of whether the property value should be minimised or maximised and what level of bias to place on the property over diversity. Is that possible ?


Finally, when the node is running, it rapidly gets to 50% progress, and then says Picking Compounds, is there any kind of numerical progress reporting that can be possible here to help gauge how long is left for the operation to complete?




This again sounds like the Maximum Score Diversity Selection problem to which I pointed to you recently :-) Just think of "Chemical Attractiveness" as Score and then you are already there. This can be solved with the Multiobjective Row Selection node. First objective is some diversity measure on a distance column, the second objective would be e.g. the sum (or negative sum) of the "score" column. You can also use the Score Erosion node, using any suitable column for the score.

Hi Thor,

I have been using the Score Erosion for this which works great, but I havent been getting the same level of diversity (using MinSum Distance or p-Centre as the measure) as the RDKit Diversity Picker provides, even when comparing like-with-like in the sense of not using the Score Erosion to bias towards any score. This is why I felt it would be useful to incorporate this score biasing into this node also to give some extra options for diversity identification.

I havent tried out the Multiobjective Row Selection node yet, but from what I understood from the paper, this generally faired less well than the Score Erosion algorithm. Is that fair ?


Hi Simon,

As I'm sure you know, multi-objective optimization is *hard*. I think that writing a node to do what you want is probably more of a research project than a knime node. An interesting research project, but still not a short one. :-)

I would guess you could get some distance towards where you want to go by iteratively combining property filters with the diversity picker, but getting the details right is going to be tricky and very problem specific.




With no bias towards any score, you mean all score values are identical? Or did you set the factor to 0? The former would be more suitable. Then it boils down to a very greedy selection: the first molecule is selected randomly, the second has to biggest distance to the first. The third has the biggest summed distance to the the first two, and so on. It will end up in N*M comparisons so it's definitly different from what the RDKit Picker does. It may very well be worse in terms of diversity, yes. Maybe Greg can give some more details on how the node does the picking? Maybe the score can be integrated similar to how Score Erosion does it.

Yeah, finally some science again :-)

Hi Thor,

I used a constant score column of 1, and left the bias default of 0.90 erosion.

As I understand it (and that is little as I have no expertise in all these NP-Hard problems, its all self-taught so likely not a good grasp!), the Score Erosion is eroding the score by distance on the last cpd picked ONLY, but of course the score would have already been eroded from previous cpd selection iterations. So the remaining datapoints have all had their score eroded in an iterative fashion from each of the selected datapoints, but its non-greedy and much quicker.

The RDKit diversity node uses the MaxMin method, where the non-selected compounds have their distances calculated to ALL selected cpds on every cpd selection iteration, and the one with the furthest minimum distance (i.e. MaxMin) is the selected one. And this repeats until the desired cpds are picked. Hence this takes longer with squared complexity on the number of cpds to be picked (as opposed to the linear complexity on the number of cpds to be picked with Score Erosion).

If it was possible for the RDKit algorithm to incorporate the scoring bias used in the Score Erosion algorithm, that would be awesome.