I have a list of SMILES which I have converted to InChi code via RDKit. I have then used the GroupBy node to Group By InChi Code and all other columns are just manually aggregated as Unique Concatenated. This is so I have 1 row for each compound.
I am finding that some structures are changing (double bond oxygens are being converted to single bond hydroxy groups for example) during this process.
I have tested the workflow and identified that it’s the GroupBy step which is causing this. If I use the GroupBy function with other data types, e.g. group by catalogue number, this does not happen.
Does anyone know why this might be happening and how I can prevent it?
I think the behavior you are seeing is probably due to the nature of Group By. If you are familiar with databases, the way it works is by creating rows for unique values on your selected characteristic for ‘grouping’. So, I would imagine that is why you see it not occuring say using RDKIT molecular descriptors as they would be pretty unique from each other.
Now, I think it is because of the way you aggregate the data on the groups which can cause it to change.
How are you aggregating your data? This can affect how the grouped data shows up which is what I suspect is causing those structures to change. Maybe try just picking the first occurance for the aggregation as maybe there are a couple different variations of the structure which is causing that difference. (I would try to pick the one that makes the most sense for your data)
I have worked it out. By converting the SMILES to RDkit and then using GroupBy, it’s merging compounds with the same structure, but different SMILES (e.g. tautomers). When I convert those RDkit back to SMILES it’s not necessarily an exact match on the original SMILES (i.e. it may be a different tautomer from the original).