Feature request: "GroupBy" using Row ID

Ergonomist · April 3, 2016, 9:34pm

Dear KNIMErs,

EDIT: Clarified misleading title.

It's all in the titile: "Feature request: GroupBy using Row ID" - pretty please!

It annoys me to no end having to apply the RowID node beforehand in order to simply count records (frequent issue) or to mass-apply aggregation functions to each row (still much less frequent, but potentially useful for e.g. logging all columns with "Sum of Logs").

Better yet, a Math node with GroupBy-like column filters & abstractions! I think I posted this a few months back already... hrm. :-)

Thanks
E

Ergonomist · April 3, 2016, 9:57pm

Posted just to bring my feature requests temporarily together...

Geo · April 4, 2016, 1:05am

RowIDs, once defined, are unique for each row, so applying GroupBy by RowID does not group any rows. To count the number of rows, Math Node has ROWCOUNT.

Maybe you could post an example to show your (potential or actual) use case ? From your description, it appears to me that you would actually want to perform groupby operations within Math node - that's a use case I can understand, on the other hand, we already have Group Loop for such tasks.

Iris · April 4, 2016, 9:29am

Hi E.

Sorry I also don't see the usecase.... If you group by RowId, you have one row per group? Could you make us an example, where you use this?

Cheers, Iris

Ergonomist · April 5, 2016, 10:55am

Dear both,

I'll try to illustrate.

My frequent issue is as follows:

Input table:

Row ID  Categorical variable 1  Categorical variable 2  Categorical variable 3

Objective: E.g. count (unique) combinations of all categorcial variables in data table

Approach needed right now: "RowID to Column" --> "GroupBy" :-(

Approach with having RowIDs in GroupBy: "GroupBy" :-)

My less frequent issue is as follows:

Input table:

Row ID  Numerical variable 1  Numerical variable 2  Numerical variable 3

Objective: E.g. log all numerical variables in data table

Approach needed right now: "RowID to Column" --> "GroupBy RowId column with Sum of Logs aggregation on numeric cols" :-(

Approach with having RowIDs in GroupBy: "GroupBy RowId with Sum of Logs on numeric cols" :-)

I can always loop or (often) use other nodes, yes, but that is slow, I/O-intense and needs a lot of nodes. GroupBy is really a Swiss Army Knife made in Konstanz. :-P Also, my frequent issue can typically be counteracted by just counting an arbitrary column, but it'll often enough also have aribrary labels in need of renaming - interpretable "counts of rows" or "counts of rowIDs" would make for more convenience.

Cheers
E

swebb · April 5, 2016, 6:53pm

Being able to select the RowID from the Available Columns in the aggregation settings would be useful. It would let you: count the number of unique instances in a group (as requested, and something I do), have a list of unique identifiers for the rows matchng that group (something else I do, which may later be used as a reference row filter to look at a subset of the input table).

So if I've understood tha last post correctly this would save having to generate a new unique value column. So an aggragation such as sum, and list would be useful for the RowID. Potentially also the String aggregations?

Cheers

Sam

Geo · April 5, 2016, 8:48pm

Now, I see what you mean: you want to groupby by the categorical variables and count the occurrences of RowID, right? The same workflow could be as follows:

Create Collection Column using all your categorical variables -> GroupyBy (CollectionColumn) and count any one variable (tick missing option)

I would agree that for simple counting, it would be convenient if one could create an aggregation column rather than having to use an existing one. SPSS (AGGREGATE) works like this, however, Stata (collapse) does not.

However, I believe this would further complicate the user interface and make GroupBy less accessible to non power users. Considering this, I do not see the harm done (not even on performance) if power users have to add a single node such as RowID (append) or Constant Value Column before using their swiss knife ;-)

tobias.koetter · April 6, 2016, 11:08am

Hi E,

several people already requested this feature and we had a look into this but had to postpone it. Unfortunately, it is more difficult then expected to simply include the rowid as yet another aggregation column. I will raise the severity of the request and see if we can alocate some resources to add this feature to the GroupBy node.

I also searched for a post from you about the Math node with GroupBy-like column filters & abstractions but could find one. What do you mean with that?

Thanks for all your valuable feedback.

Bye

Tobias

Ergonomist · April 6, 2016, 3:56pm

Sam,

Indeed I'd be happy to have it for everything. Why settle for less? :-)

Geo,

To each their own - but I'm also in favour of usability & accesability, my handle is to be understood literally! :-) I wouldn't mind ticking an "advanced features" box, either in each node or (preferably) somewhere as default in the settings. This way we keep node counts low, "power nodes" accessible, and all user types happy.

Tobias,

Re math formula node, I am referring to the now-standard column selection interface with column manual/regex/type selection.

Having arbitrary mathematical expressions (as provided by the Math formula node) applied to a set of columns selected by the aforementioned column selector interface would be of great value - especially for log-transforming all columns of type X, ranking all columns of type Y, etc. All frequently required massaging of inputs, in one word.

Cheers
E