RDKit Fragmenter node query

Good evening

We've notice some unexpected behaviour on the RDKit Molecule Fragmenter node. When running the node multiple times we are getting the same results however if we shuffle the the order we get 'different' fragments produced when assessing the Fragment SMILES column. 

It appears this is due to different SMILES strings being produced depending on the sort and this is what was being used to compare.

Is it possible to get a canonical fragment identifier? A colleague was attempting to identify novel fragments between two datasets. I tried the RDKit Canon SMILES node using the Fragment as an input which significantly reduces the novel fragments depending on sort but does not completely resolve the issue. 

Looking at the python documentation the MolFragmentToSmiles(...) method does have a parameter for canonicalisation. 

Cheers

Sam

Hi Sam,

thanks for reporting this issue. Sorry for the late answer ... vacation time of the year.

Would you be able to attach a simple workflow that demonstrates this problem to understand it better? Best would be to have the positive and the negative case in one workflow. Thanks for your help here. I hope to find some time in the next 2 months to look in to it and to provide a fix, if possible.

-Manuel

I created a test workflow showing this problem back before my vacation (unfortunately when the forum was still in read-only mode).

It's attached.

If things are working properly, port 2 on the two reference row splitters should be empty

 

Thanks Greg. 

If it's solvable from the Java code maybe I could have a go during next Fridays hackathon.