Groupby node

richards99 · May 15, 2012, 10:28pm

Group by node is incredibly powerful and useful but it can be difficult to use in an automated workflow with changing columns, or many columns.
For instance it’s powerful in removing duplicate sets of data. But when you have many columns, it’s laborious to add all the columns to take the first value or mean value, and also if columns change it all needs reconfiguring.
Is it possible to have another groupby node where you can specify default options on how to aggregate column types, I.e. string, Double, integer, others, etc.

Simon.

tobias.koetter · May 16, 2012, 9:01am

Hi Simon,
so what you want is a groupby node where you can specify the grouping columns in one tab and an aggregation operation per data type in the second tab. The whole table should be aggregated to one row if no grouping column is specified in the first tab (like it is now). The second tab will allow the selection of an aggregation method for a specific data type (only one per data type). When executed the node will use the appropriate aggregation method on the input columns ignoring the ones where no aggregation method has been selected. The name of the result columns is either the original name or the original name with additional information about the used aggregation method.
Do you have an idea for a name of the new node?
Bye,
Tobias

richards99 · May 16, 2012, 9:46am

Hi Tobias, That is exactly the type of node I am after, it will make Grouping much easier for large datasets of columns and easier to automate workflows in anticipation of additional columns in the future. And as you say, if no Grouping column is specified then it aggregates to one row like the current node. Generally speaking, it works exactly as the current GroupBy node but rather than hand picking the aggregate type per column, you just need to specify it per column datatype.

An example use case would be to GroupBy a MoleculeIDNumber, and then Aggregate on the remaining columns, by choosing to take First Value for String columns, First Value for Smiles columns, and take the Mean for Double columns and Integer columns. This would be very useful in the type of workflows and datasets we use in Pharma.

The node could by called something like "Group By Column Datatype".

Thanks,

Simon.

tobias.koetter · May 23, 2012, 12:17pm

Hi Simon,

I have talk to other people about the new node idea but we think it would be better to add a "default tab" to the existing group by node. In doing so the new option tab would be also available to other nodes that use the aggregation framework such as the pivoting node.
The new tab would look similar to the "Options" tab of the existing gorup by node. The new tab would display a list of the available data types instead of the available columns list on the left hand side. On the right hand side the table would contain a data type column instead of the "Column" column. This would allow the user to select an aggregation method for each available data type.

What do you think about this solution?

Tobias

richards99 · May 23, 2012, 6:54pm

Okay that makes sense, a bit like how the Missing Values node works, is that right ?

Simon.

tobias.koetter · May 24, 2012, 4:33pm

That's right. The node will have a default tab that allows the selection of a DataType and default aggregation method.

Tobias