GroupBy node and SD files

Hi all.

I've found that using the GroupBy node to group molecules by cluster IDs messes with the structure files. It seems to flatten them out into single strings of text, which can't then be reinterpreted as structures. I've tried using SDF, CDK, RDKit formats, but in all cases the underlying formatting is broken.

Does anyone know how to get around this?

Using SMILES is an option, but then I have to convert multiple (unknown number) of columns of SMILES into structures. Similarly, I tried grouping mol IDs, rather than the structures themselves, to be replaced later, but again I don't know how to iterate over an unknown number of columns.



Hopefully this is an appropriate place for this query.

Didn't test your problem, but possible workaround - after grouping, delete the messed up structure column, and add it back from original table using Joiner.

Hi Rich,

could you please be a bit more specific about the grouping column and aggregation methods you used. I asume that you used the cluster id which is a string or numerical column as group column but what aggregation method do you apply on the structure column? As long as you use aggregation methods that do not convert the cell content (e.g. first, last, mode, max, min, list/set) the structure shouldn't change.




OK, so I have a node that turns my molecules into fingerprints, then a distance matrix is built and k-mediods calculated. This gives me a series of molecules (with IDs, fingerprints etc.) that have been assigned to a cluster and given a cluster ID (Row1, Row2... Row15 etc.).

What I tried to do was group all the molecules that belong to each cluster into rows. One row per cluster. So I've used the GroupBy node to group by the cluster IDs. This is where the SD files get flattened. I can separate the SD strings using Cell Splitter, but the text is still messed up. As you say, first, max, min etc. don't mess up the structure but all other options, including list and set, flatten the text out and comma separate it. I've just found that Transpose node does the same thing. 

I've tried grouping and then separating molecule IDs instead, but then I have to replace an unknown number of columns of mol IDs with their corresponding structure. I could do this by having lots and lots of Cell Replacer nodes, but that’s a horrible solution. If I’m going to work around this problem I need to be able to loop across columns. Is there any way to do this?

Kitsune, just tried that. Same problem of doing operations on multiple columns.

I see your problem now, aggregation methods in GroupBy node are really for strings only and kill SDF formatting.

Ok, I kludged up a workflow that does the trick, but it's a little Frankenstein of about 20 nodes.

If you are still interested, I can make a writeup how it works.

ETA: Transpose works for me with SDF, if there is only molecule column in table.

Yes please. That should be very helpful.

Is it possible to change the type of a column?

After my manipulation, the molecules have been typed as strings. If I can change their type, maybe I can use something like RDKit molecule format, which is more like SMILES strings. 

Hi Rich,

I usually solve that by doing a GroupBy with a List-aggregation. That will give me a collection column containing multiple SMILES / SDfiles that can be split into individual columns using a "Split Collection Column" node. The column types and formats should not be affected by this operation.

Best regards, Nils

Picture of flow in attachment.

Read file, some corrections of missing values, GroupBy by your clusterID, for aggregation take some molecule name column or something, and as method - value count, so you get column with count of how many molecules you have in each cluster.

nodes 10-12-11  - find what is the largest mol count in clusters and put this value in some flow variable, lets say named X.

Then we need to make empty table with no rows, but X columns of type SDF molecule. That is kludged by nodes 17 to 21 - filter out from starting table X rows (use the flow variable X), filter out all colums except Molecule,   make fresh RowIDs, Transpose (it works  ok with sdf, if there are no other colums to confuse types) and filter out the only row from it.

Now we can do looping. TableRow to Variable loop uses grouped table and passes your clusterID from each row as flow variable to loop body. Here we filter from starting table rows by passed clusterId varaible, again leave only molecule column, make new RowIDs, transpose and Concatenate with previously prepared empty table of SDF.

That is passed to loop end, which collects summary table. 

If you need to pass additional data from starting table, add those by usual methods with Joiner, collection colums and like.





ETA: Of course, if aggregating as List do preserve sdf format, then all this little mess is not needed

Nils' solution has done the trick.

I'll still have a go with your workflow Kitsune, as an exercise in loops.