removing duplicate rows

sfalah · December 28, 2014, 6:44pm

Dear All,

I am trying to use the groupby node to remove duplicate from some file but I have not succeeed to acheive this goal.

The file I have contains more than 50 columns. I want to remove dulicates rows (compounds) depending on their chembl name column. For these duplicate compounds, I want to compare the IC50 value and keep the one with the highest IC50. I have read some posts suggesting that this can be done using the "sort" and "group by" nodes but I have not managed to figure it out. I am confused about the differnt options when configuring the node. I watch some vidoes on the youtube but I can see that the node setting has been sitting among the different versions.

I have installed the lastest version of knime.

Any advice will be highly appreciated.

Regards

Sherin

unknown_user · December 28, 2014, 8:02pm

Hi can you give data examples of what you have in entry and what you want in the output ?

fab · December 29, 2014, 9:27am

Hy Sherin,

You need to have compounds in smile format or better canonical smile and in group by node select group column by i.e name or as you want and aggregation by first in the group .

hope this can help you

nbrooijmans · January 5, 2015, 4:03pm

You want to do a multi-sort based on Chembl name first, and IC50 2nd (from low-to-high). Then you can do a group-by based on Chembl name where you choose to only keep the 1st entry, which will be your lowest IC50.

Docminus · January 7, 2015, 10:03am

I found that for sorting and comparing, INCHI does a better job than (canonical) smiles, you don't have to worry too much about e.g. tautomers among other things. Personally I don't like working with names, since they are not always (easily) available and tend to get huge if you have complex molecules.