Using GroupBy with InChi Code data

Hello,

I have a list of SMILES which I have converted to InChi code via RDKit. I have then used the GroupBy node to Group By InChi Code and all other columns are just manually aggregated as Unique Concatenated. This is so I have 1 row for each compound.

I am finding that some structures are changing (double bond oxygens are being converted to single bond hydroxy groups for example) during this process.

I have tested the workflow and identified that it’s the GroupBy step which is causing this. If I use the GroupBy function with other data types, e.g. group by catalogue number, this does not happen.

Does anyone know why this might be happening and how I can prevent it?

Thanks,
Sian

Further to this I have found this doesn’t occur when I GroupBy using RDKit molecular descriptors instead of the InChi codes.

Hello @Sian_Evans1,

I think the behavior you are seeing is probably due to the nature of Group By. If you are familiar with databases, the way it works is by creating rows for unique values on your selected characteristic for ‘grouping’. So, I would imagine that is why you see it not occuring say using RDKIT molecular descriptors as they would be pretty unique from each other.

Now, I think it is because of the way you aggregate the data on the groups which can cause it to change.

How are you aggregating your data? This can affect how the grouped data shows up which is what I suspect is causing those structures to change. Maybe try just picking the first occurance for the aggregation as maybe there are a couple different variations of the structure which is causing that difference. (I would try to pick the one that makes the most sense for your data)

Hope this helps,
TL

2 Likes