Comparing screening libraries

I am new to KNIME and RDkit. For an experimental assay I need to decide which screening library to buy. I therefore want to compare a set of libraries in terms of their diversity. Do you have any suggestions on how to proceed?
we actually did this in KNIME. Albeit I obviously can’t share the workflow I can give some hints what I did. All displayed in a Report from Knime Reporting.

  • Molecules count and duplicate counts (yes some libraries contain duplicates)

  • Amount of problematic molecules (undesired functional, groups or fragments defined in-house)

  • nr of scaffolds and Histogram of scaffolds (only top 1000) + showing top 5 scaffold structures. Gives some indication on diversity

  • Histogram of properties / descriptors like MW, logP TPSA, ring counts, hetero atoms, rotatable bonds…depends what’s important for your use-case.

  • histogram of “self-similarity”
    For each molecule, the FP similarity of nearest neighbor in the library was determined and a histogram was made. So if the molecules are all very similar you see high counts for high similarity. IMHO here you want rather low similarity BUT some high ones as well (same scaffold) and hence detect good scaffold.

  • Listing 50 diverse compounds from the library (using the Diversity Picking node) This helps to see if the library has many strange compounds. Some contain trivial things like Ethanol or very esoteric structures.

This report helped a lot to make a decision and for our use case the best ones were pretty clear (they were from well-known providers)

Thanks a lot for your advice, I’ve already started with a similar approach. Especially your input regarding Scaffolds was helpful!

