We've notice some unexpected behaviour on the RDKit Molecule Fragmenter node. When running the node multiple times we are getting the same results however if we shuffle the the order we get 'different' fragments produced when assessing the Fragment SMILES column.
It appears this is due to different SMILES strings being produced depending on the sort and this is what was being used to compare.
Is it possible to get a canonical fragment identifier? A colleague was attempting to identify novel fragments between two datasets. I tried the RDKit Canon SMILES node using the Fragment as an input which significantly reduces the novel fragments depending on sort but does not completely resolve the issue.
Looking at the python documentation the MolFragmentToSmiles(...) method does have a parameter for canonicalisation.