I have a query about the behaviour of the Groupby node.

I have a sorted list of molecules which I wish to keep the highest scoring docking result (I generate multiple poses per molecule), so I perform a groupby operation and keep only the first result. 

The molecules are uniquely named for their rows, but after the Groupby operation (performed by canonical smiles) the new RowIDs do not correspond to the molecule names (as I would have expected) for example if the molecule named Row2790 was the best scoring pose, then the groupby operation returns that molecule, but with a new Row122.

so, it seems that the Groupby operation gives the output table totally new RowID's.  Is this the case?


Yes, this is the intended bevavior for the node as the situation where the output corresponds 1-1 with an input row is only a subset of the functionality of the node.  For example, if you take the mean of a group instead of the max, then the output rows can't be associated with one and only one input row.   

If you need a more complicated aggregation method than we provide with the GroupBy node (eg, the RowID for the row with the highest value in another column) then I suggest taking a look at the Group Loop Start node. This node can be used to do custom aggregations.  In your case it would be something like:  Group Loop Start > Sorter (Value: descending) > Row filter (1st row) > Loop End.

Alternatively, if one would like to have a full control on the treatment of rowid’s, one can copy the rowid’s in a new column before the groupby node (through string manipulation node for instance with the string($$ROWID$$) expression. And put the rowid’s back after the groupby node with the rowid node.

Ok, thats good to know, thanks both!