I would like to cluster a set of molecules (100s - 1000s, or even more) based on chemical fingerprints (something like hierarchical or k-means clustering for instance), but I'm not sure which nodes would be the most appropriate. It doesn't need to be exact.
The output should be a cluster ID for each molecule, with an indication if a particular molecule is the cluster center or not.
Try the k-Medoids node. This should work pretty well.
Use the RDKit Fingerprint node to generate the FPs (Morgan for instance), then use the Distance Matrix Calculate node to generate a Distance Matrix. Now connect this to the k-Medoids node, and specify how many clusters you would like. The cluster centre (Medoid) is reported also.
got the same question here. I'm new to knime and want to cluster the molecules. I'm reading in smiles using line reader from a text file, and it worked. but when I tried running the RDkit from Molecule, it says 'No column in spec compatible to "SmilesValue" "SmartsValue" or "SdfValue". ' How can I solve it? Thanks!
You need to be sure that the column coming from the File Reader is marked as a SMILES line.
There are two easy ways to do this:
Change the type directly in the File Reader node by clicking on the column header and setting the type to SMILES
Using the Molecule Type Cast node to convert the column to type SMILES after the table has already been read in.
After you do this the RDKit nodes should work without problems.
Note that you don't need a Molecule to RDKit node in order to generate molecular fingerprints for clustering. The RDKit Fingerprinter node can also directly process SMILES (or SDF) columns.