Delete rows almost duplicate


My datatable contains dat about molecules (structures, id, ic50, date...). In some rows, structures and ids are identical but not the other properties. I'd like to keep onyl the row with the lowest ic50, so with the "GroupBy" node, I select the column structure and the column id. In option, I use the minimum agregation method for column ic50, but for the other columns, I don't know which method I should use. I want to be sure that I keep the data conresponding with the righ ic50. I don't want to mixe up the rows.

Do you see a solution?


Thanks also for this interesting question about the GroupBy node. Correct, you can’t have depending column aggregations, but you still can do the group-by and aggregation (minimum) and then join the original table based on those two columns. Note, this can lead to more than one row per group, since the minimum can appear multiple times. In order to remove those rows, you need to do a GroupBy again.

You can do this easily with the Groupby node. Simply sort the data in ascending ic50 with the sort node first. Then use the groupby node to groupby id and then aggregate on all the columns by taking the first value. I.e. the first value will be the lowest ic50, and all the other values from the other columns will be on the same row as the lowest ic50 value.


Both methods work but the second is more efficient.

Anyway, thank you.


You are right Simon. The groupby node uses a stable sorting algorithm, which means that the implementation preserves the input order of equal elements in the sorted output. So you can safely sort the input table first using the sorter node and use the first aggregation method afterwards in the groupby node.

Thanks for the hint Simon.