I am currently working on a small Cheminformatics workflow clustering a range of chemical substances by calculating Murcko Scaffold (RDKit) and then using the SMILES for these generated Scaffolds to cluster my list of coumpounds, using GroupBy.
Doing this, I came across a problem: obviously, GroupBy does not differentiate between
c1ccccc1 and
C1CCCCC1
In chemical terms, these structures are very distinct, the first one being cyclohexane, the second one representing an aromatic ring.
Did anyone else come across this and is there a way in the GroupBy node to get a case-sensitive clustering done?
Hi Joachim,
you are right, case-sensitive grouping would be a useful feature to have. I don’t think that is currently possible, but I have a workaround you may be able to use. Using a String Manipulation node I prepend a § to each capital letter (I assume that this character will never occur in a SMILES string). Then I do the GroupBy and after that I can remove the character again using another String Manipulation node. I hope the attached workflow helps a bit!
Kind regards
Alexander
Hi,
I thought about it again and found it really strange that GroupBy is case-insensitive and so I tried it myself again. For me the GroupBy creates two groups for the example strings you gave above, so the workaround should not even be necessary. Can you share your workflow? Now I am curious how it is case-insensitive for you.
Kind regards
Alexander
Hi Alexander,
I can confirm your trial - strangely enough, when I did it this morning (before writing the post in the forum), I could not separate both cases and needed your workaround. Now, working to create a shareable workflow for you, I checked again and the GroupBy nicely differentiates between both cases even without your workaround.
Has KNIME been known to be affected by Monday morning blues or so?
Thank you for your help and support, I think we can close this case