Diversity Picker


I do like the Diversity Picker but do have a few comments for possible improvement ;-)

1. An existing SAR dataset may already exist, and you are looking for diversity in a set of a reagents for building a diverse combinatorial library. It would therefore be beneficial if there was an optional second input of RDKit structures which allows the diversity selection to bias away from.


2. Instead of selecting a quantity of molecules to pick to give a Diverse Set, it would be useful to have an option to set a Distance Value in which the node picks a selection such that the chosen molecules cover the whole set and are not more than the chosen distance away. i.e. so choosing a small distance would give a larger output of molecules as it is giving better coverage across the whole set. You may have better ideas how to implement this, but essentially being able to choose how well you want to cover the landscape in the output.


3. Thirdly, it would be useful to have a Diversity Picker which can utilise the Distance Matrix table in KNIME. This way, you will have the option of picking a Diverse Set of rows from not just a Fingerprint but also from a set of calculated properties (i.e. Mw, cLogP, PSA etc which were used to build a Distance Matrix).


Any possibilities,


Hi Simon,

1) and 3) should be relatively straightforward. I'm not quite sure what you mean by 2) though; could you explain a bit more?

We're currently doing some major refactoring of the RDKit nodes to make them more supportable and faster. When that's done, I will take a look at making these changes (as well as the "show cluster membership" suggestion from James.



thanks for the information, and I look forward to seeing some of these changes.


For point 2, I'm finding it difficult to describe. Basically when you are choosing a number of molecules to return from the diversity picker, you don't really know how much of the chemical space you have covered from the chemicals input into the node. Of course if the molecules input had a uniform diversity spread between then choosing 10 from 100 would give you 10% coverage of the chemical landscape. But what if a large portion of those 100 were chemically very similar, choosing 10 would give a much higher chemical coverage than 10%. So what I am asking is if there is a way to specify how much chemical coverage you want, and the node decides how many molecules need to be selected to get that coverage. I hope this makes some sense. I am not sure how one would define chemical coverage however, I.e. whether to relate it to fingerprint distances etc between choosen molecule and max distance to an unchosen molecule.



Hi Simon,


the RDKit Diversity Picker has been improved a little and accepts now an optional second table which allows the diversity selection to bias away from. Please have a look and try it out when you get a chance. It is now available in the nightly build for testing.


Kind regards,


Hey Manuel, Sorry for the delay in replying. I have just tried out the improved diversity picker, this is really great and useful. This is going to have some real regular use from developing reaction enumeration libraries to then pick out examples for synthesis which are diverse from existing SAR.

And the added feature of automatically generating fingerprints if only a structure column is selected is great.  Even when it's just SDF format. The rdkit nodes are becoming extremely versatile and friendly to all user levels.


Many thanks, Simon.

Hi Simon,

thanks for your positive feedback. I am happy that the nodes are useful and are being used :-).



Hi all,

I am wondering whether the functionality from Simon’s point 3 in his initial post can be added to the Diversity Picker node? It would be great to have the capability to run the MaxMin algorithm on a user-defined distance matrix, because I am using a 3-D similarity score to measure diversity, not Tanimoto. I happen to already have the distance matrix pre-calculated, for (separate) clustering purposes.


Hi all,

I don’t mean to be a bother :slight_smile: but thought I’d follow up on this request. I’m working on a project where I will need to present data in a 2-3 weeks, and the inability to run a MaxMin algorithm on my distance matrix instead of Tanimoto has been identified as a key drawback to my workflow. Unfortunately, I don’t have much scripting experience, so I’m not sure where I’d start with a potential workaround!

Perhaps the folks from RDKit could implement this capability, since @greglandrum suggested it would be relatively straightforward in his first reply to this thread? Or does anyone know of another way to run the MaxMin algorithm that doesn’t rely on designing a custom node?


I think your best bet here is to call the MaxMin picker using a Python scripting node. The RDKit’s python interface does allow you to use a pre-specified distance matrix.


Thanks for the advice, Greg. If scripting is the only way to go, I’ll try to find a crash course in Python online, and hopefully can piece enough together to make it work before my deadline. I’m a synthetic organic chemist by trade, so I’m starting at the ground floor when it comes to coding, or working with a python interface :sweat_smile:


Alright, so after some legwork, I managed to install Anaconda, and I created an environment with all of the pandas packages and RDKit, and set my KNIME preferences to use that environment for the python nodes. I think this part worked fine, because the rdkit package shows up if I search in the Anaconda environment, and in KNIME I get a message in the Python Scripting node “Successfully loaded input data into Python”, although I’m a little unsure because if I type “from rdkit import Chem” and try to execute the script, I get a kernel error (in this text file)Python Scripting Knime Node Kernel Error.txt (3.4 KB)

The one thing that no reasonable amount of digging through the internet or youtube videos can help with is how to actually implement the python script that Greg is referring to. I’ve attached my workflow with an example distance matrix. Perhaps the path forward will become obvious if there’s a way around that kernel error, but as a newbie I’m not sure how I’d go about calling the MaxMin picker to accomplish the analysis I am interested in. And is the initialization below appropriate (taken from this Youtube video at 11:46 in)? If someone with a little more experience could take a look and provide some insight on how to accomplish this, it would be tremendously helpful!


MaxMin on User Defined Distance Matrix.knwf (2.3 MB)


Can you clarify what steps you actually took to install the Conda environment? Can you confirm what version of Python you’re using? That log file shows a lot of Python 2 errors, when Python 3 is what’s recommended as the default.

I’d recommend following the instructions for automatic environment creation from here: KNIME Python Integration Guide

In the terminal in my base Conda environment, I ran the following two lines of code, taken from here

conda create -c conda-forge -n my-rdkit-env rdkit
conda activate my-rdkit-env

That is strange about the python 2 errors, because KNIME seems to indicate that I was using Python 3. This is what my Preferences window looked like while I was getting that kernel error:

On the bright side, I uninstalled Conda and reinstalled following the guide you recommended, and it seemed to solve that issue. I made an automatic environment with all the dependencies required by the KNIME python integration, then installed the rdkit package in that environment using:

conda install -c rdkit rdkit

and it worked like a charm! I can now run the initialization shown in my previous post without errors.

Now it’s just a matter of how to call the MaxMin picker to accomplish the selection of a maximally diverse subset?

Figured it out! This resource, written by Greg Landrum, was very useful in adapting the Python script. I’m posting my configured workflow here in case it may help others in the same boat as I was in.

MaxMin on User Defined Distance Matrix Configured.knwf (2.2 MB)

Best regards,

1 Like