What governs the ordering in GroupBy?

BPerry · February 23, 2017, 4:34pm

Hi,

I'm trying to understand how the groupby node decides what entry is first, and what entry is last when aggregating grouped rows.

I have a siituation in a workflow where I'm bringing 10 or so sdf files together (around 10k rows total after concatenation), grouping by an InChiKey column, and aggregating about 4 columns based on "First" selection, and another column based on "List" selection.

I've used this approach for the past two years without issue - I've always assumed that the row order going into the groupby node dictated what is first, what the list order will be when agregating etc. Howver I now have a repeatable example where the post-grouped table is classifying the *First* entry in an aggregated grouped row as being the entry that is actually "second" in the relative row order when looking at the table pre groupby node. All other examples of grouping are behaving "normally" in that the "first" entry in the pregroup by is being classified as the first entry-

I've made many attenpts to re-assign row order etc to no avail. The only way I could get the groupby node to give me the desired list order and correctly identify the "First" entry in the table is when I deleted the input sdf reader with the offending row, created a new sdf reader and then pointed it at the same file. I couldnt belioeve this worked, so I tried it a second time and this time I got the same probelm.

What am I not seeing? Thanks for any help :-)

tl;dr What dictates the entry order in the groupby node for assigning First, Last, and collating lists- I have repeatble workflow issue that shows its got nothing to do with the row order of the entry table

tobias.koetter · March 1, 2017, 10:26am

Hello,

the groupby node is sorting the input table by the specified group columns if you haven't enabled the process in memory option. The used sorting algorithm is stable which means that the order of rows within a group does not change. Thus the first element returned by the aggregator should be the value of the first row per group. This has been the same over the last years. Can you provide us with an example flow or the offending row and the "first" row? Maybe it is a problem with the comparator implementation of the SDF cell which is used by the groupby node if you group on a sdf cell.

Thanks

Tobias