Compare chemical structures contained in different sd files

Nico1990 · April 18, 2012, 11:18am

Hello everyone,

I don't see any solution to solve my problem, actually I don't even know if there is a solution.

My sd files contain chemical structures. Only one attribute is associated for each structures : its frequency.

I'd like to make one file that would contains in one column the structures and in the other columns the frequencies extract from each original file. If a structure is not in a file, then there would be a blank.

At this moment, I have this: List files --> Table row to variable loop start --> read sdf --> loop end (iteration column created) --> create colection column (to put together structure and frequence, that creates a column nammed AggregatedValues) --> group by (group by iteration and the aggregated values are aggregated in a list) --> split collection column (split the list(aggregatedValues)) --> column rename (regex) (to add a prefix in columns name) --> and then...

I've tried to put another "split collection column", but even if it works for the first collection, i've to renamme each new columns before spliting the second column. That would be fine with a dozen of columns, but in my case there are more than 1,000 columns...

Futhermore, if I do like this, the stuctures won't be associated by identity...

So, I don't know what to do. Do you think, what I imagine is possible. If yes, do you know how to do it?

Thank you.

Nicolas

gkirsten · April 18, 2012, 2:09pm

Hi Nicolas,

I'm not sure that I understood your problem correctly but for me it sounds like a typical example for the pivot node. Have you tried this?

Guido

Nico1990 · April 18, 2012, 4:19pm

Hi Guido,

I found a solution before reading your answer. Reading the description of the "pivoting" node, I'm sure it would be very helpull!

However, here what I've approximately done: List files --> Table row to variable loop start --> read sdf --> loop end (iteration column created) --> create collection column (between iteration and frequency) --> groupby (structure, with sum of the frequencies and list of the aggregate column / NB: smiles format is more appropriated to group molecules)

And so, I have a repartition of the molecules depending of the files and also the total frequencies for each of them.

Hope, that will help someone!

Thanks

Nicolas