GroupBy node and SD files

rsherhod · April 13, 2011, 12:50pm

Hi all.

I've found that using the GroupBy node to group molecules by cluster IDs messes with the structure files. It seems to flatten them out into single strings of text, which can't then be reinterpreted as structures. I've tried using SDF, CDK, RDKit formats, but in all cases the underlying formatting is broken.

Does anyone know how to get around this?

Using SMILES is an option, but then I have to convert multiple (unknown number) of columns of SMILES into structures. Similarly, I tried grouping mol IDs, rather than the structures themselves, to be replaced later, but again I don't know how to iterate over an unknown number of columns.

Thanks

Rich

Hopefully this is an appropriate place for this query.

Kitsune · April 13, 2011, 5:05pm

Didn't test your problem, but possible workaround - after grouping, delete the messed up structure column, and add it back from original table using Joiner.

tobias.koetter · April 13, 2011, 6:06pm

Hi Rich,

could you please be a bit more specific about the grouping column and aggregation methods you used. I asume that you used the cluster id which is a string or numerical column as group column but what aggregation method do you apply on the structure column? As long as you use aggregation methods that do not convert the cell content (e.g. first, last, mode, max, min, list/set) the structure shouldn't change.

Thanks,

Tobias

rsherhod · April 14, 2011, 10:55am

OK, so I have a node that turns my molecules into fingerprints, then a distance matrix is built and k-mediods calculated. This gives me a series of molecules (with IDs, fingerprints etc.) that have been assigned to a cluster and given a cluster ID (Row1, Row2... Row15 etc.).

What I tried to do was group all the molecules that belong to each cluster into rows. One row per cluster. So I've used the GroupBy node to group by the cluster IDs. This is where the SD files get flattened. I can separate the SD strings using Cell Splitter, but the text is still messed up. As you say, first, max, min etc. don't mess up the structure but all other options, including list and set, flatten the text out and comma separate it. I've just found that Transpose node does the same thing.

I've tried grouping and then separating molecule IDs instead, but then I have to replace an unknown number of columns of mol IDs with their corresponding structure. I could do this by having lots and lots of Cell Replacer nodes, but that’s a horrible solution. If I’m going to work around this problem I need to be able to loop across columns. Is there any way to do this?

Kitsune, just tried that. Same problem of doing operations on multiple columns.

Kitsune · April 14, 2011, 4:01pm

I see your problem now, aggregation methods in GroupBy node are really for strings only and kill SDF formatting.

Ok, I kludged up a workflow that does the trick, but it's a little Frankenstein of about 20 nodes.

If you are still interested, I can make a writeup how it works.

ETA: Transpose works for me with SDF, if there is only molecule column in table.

rsherhod · April 14, 2011, 4:36pm

Yes please. That should be very helpful.

rsherhod · April 14, 2011, 4:54pm

Is it possible to change the type of a column?

After my manipulation, the molecules have been typed as strings. If I can change their type, maybe I can use something like RDKit molecule format, which is more like SMILES strings.

weskamp · April 14, 2011, 5:33pm

Hi Rich,

I usually solve that by doing a GroupBy with a List-aggregation. That will give me a collection column containing multiple SMILES / SDfiles that can be split into individual columns using a "Split Collection Column" node. The column types and formats should not be affected by this operation.

Best regards, Nils

Kitsune · April 14, 2011, 5:47pm

Picture of flow in attachment.

Read file, some corrections of missing values, GroupBy by your clusterID, for aggregation take some molecule name column or something, and as method - value count, so you get column with count of how many molecules you have in each cluster.

nodes 10-12-11 - find what is the largest mol count in clusters and put this value in some flow variable, lets say named X.

Then we need to make empty table with no rows, but X columns of type SDF molecule. That is kludged by nodes 17 to 21 - filter out from starting table X rows (use the flow variable X), filter out all colums except Molecule, make fresh RowIDs, Transpose (it works ok with sdf, if there are no other colums to confuse types) and filter out the only row from it.

Now we can do looping. TableRow to Variable loop uses grouped table and passes your clusterID from each row as flow variable to loop body. Here we filter from starting table rows by passed clusterId varaible, again leave only molecule column, make new RowIDs, transpose and Concatenate with previously prepared empty table of SDF.

That is passed to loop end, which collects summary table.

If you need to pass additional data from starting table, add those by usual methods with Joiner, collection colums and like.

WBR,

Kitsune

ETA: Of course, if aggregating as List do preserve sdf format, then all this little mess is not needed

knime.gif

rsherhod · April 15, 2011, 11:12am

Nils' solution has done the trick.

I'll still have a go with your workflow Kitsune, as an exercise in loops.

Thanks

Rich