GroupBy - avoid having to specify each output column?

Hello,

 

This request relates to this thread:

http://tech.knime.org/forum/knime-general/column-limit-on-groupby-node

 

Basically is there a way of simply having columns from 2 "merged" rows (by "merged" I mean from GroupBy or Joiner nodes etc) output automatically without having to specify which columns are output (like in Pipeline Pilot)?
 

It's rather annoying having to go through all my groupBy nodes in a workflow whenever I make adjustments just to change parameters needlessly for one new column.  Whilst I understand that many users would want the column-specific functions of the GroupBy node (particularly the statistics) it'd be great to have something which just automatically appends the columns...  Unless this already exists?

 

Thanks in advance,

Ed.

 

Hi Ed,

Maybe a "column list loop node" defaulting to "all columns" will do the trick? It's sequential and may slow down processing, but if there are not too many columns it might be a useful workaround. However, bear in mind that you need to handle the column in question by feeding its variable name into the GroupBy node, or by stripping off the header prior to processing (and optionally weaving it back in afterwards).

Cheers
E

What if you were to unpivot first, groupby on your orignial grouping key + the column name, and then pivot back out to reshape the table?

Ergonomist & Aaron

Thanks for the tips, I'll try this out at some point.  It just seems strange that there isn't automatic support for this in KNIME.

 

Ed.

Yes, it is much needed and something I also requested some time back. The fact it is getting mentioned by multiple users now will hopefully mean it will arrive soon in a future version.

other posts on this; 

http://tech.knime.org/forum/knime-general/groupby-node-request

http://tech.knime.org/forum/knime-general/groupby-node

And there was another one a few weeks ago I can't find the link to.

 

simon.

Simon - yeah I was sort of hinting at a feature request too as this'd be awesome.

 

Ed.

I also recommend such a feature....this week I had to manually reconfigure 130 metanodes due to a GroupBy node not being able to do this kind of auto-merging.

Hello everyone,

the good news is that we are currently working on this feature. The bad news is that it will not make it into the next release which will be this week. But I'm very confident that it will be finished for the next release at the end of this year. Please bear with us.

We are very sorry for the long delay of this frequently requested feature. The major reason for this delay is a missing central data type repository in KNIME which would allow us to provide the new functionality for any data type even custom developed types which we do not know. Since the development of the central data type repository takes more time than expected we have decided to implement the default column handling for the GroupBy node at least for the basic data types e.g. int, double, string, etc. similar to the new data type based filter option of the column filter panel.

In the following I will describe the new functionality. If you have any comments or ideas please write a comment to this post.

The GroupBy node will provide an additional Default tab which will be similar to the existing Option tab. Instead of selecting a specific column you can pick a data type from a predefined list of supported data types such as int, long, double, string, etc. and specify any number of "default" aggregation methods for it. The "default" aggregation methods will be only used for columns of the selected type which are not selected in the Option tab e.g. manually added. For example, if you have an input table with double col1, col2, col3 and specified to use "average" for col1 in the Option tab and "mean" as the aggregation method for double columns in the new Default tab the result table will have the columns average(col1), mean(col2) mean(col3). In order to use the same aggregation methods for a specific type you would only define "default" aggregation methods in the Default tab. This way you only need to use one node to define aggregation methods for specific columns or to handle any number of columns by type.

Bye,

Tobias

I agree with Eds comments exactly.

But the initial proposal from Tobias would be a very good starting point.

Simon.

Tobias:

 

That sounds pretty good.  I would like to see 2 additional things:

- Regular expression support for column aggregation, rather than simply by data type

- That this functionality is also copied over to the "joiner" node, because we still have to specify which columns are output as well.

 

Ed.

Hello,

the GroupBy node has now three tabs to specify the columns to aggregate and the aggregation function to use:

  1. directly (as before),
  2. based on a regular expression (column name),
  3. by data type.

(See attached files for a screen shot of the two new tabs)

Input columns are handled in this order and only considered once e.g. columns that are added directly on the "Column Based Aggregation" tab are ignored even if their name matches defined regular expression on the "Regex Based Aggregation" tab or their type matches a defined type on the "Type Based Aggregation" tab. The same holds for columns that are added based on a defined regular expression. They are ignored even if they match a criterion that has been defined in the "Type Based Aggregation" tab.

We hope you will like this new functionality which will be available with the next KNIME release in winter and accept our apology for the delay of this frequently requested feature.

Let us know if you have further comments or ideas.

I have also opened a feature request for the joiner node but cannot tell you any timeline.

Bye,

Tobias

 

Brilliant this is absolutely what's needed. And the behaviour you describe is as I would want it.  Someone new coming to the node may find the "column based aggregation" tab name a little confusing, would "manual based aggregation" be better. 

Also, Ihave just thought about one possible problem with the regex tab.

What happens if you choose Mean as aggregation type and the RegEx happens to find a string column besides the intended Double columns the user was intending. Will the node fail? Or can it work out the regex matches in advance in the dialog so Mean no longer becomes an option.

if that is the case great, but what about a second scenario where the regex only finds Double cells and you select Mean, but then at a later point on rerunning the workflow you now have a String column which matches the RegEx, then what happens.

sorry for looking for worst case scenarios to break the node.

simon.

I like it, but I also agree with Simon about the naming issue (see Column Filter node) and risks of aggregation failure.

Also, this tabbed view suggests that you can mix & match aggregations in one node execution (unlike in the Column Filter node, where options are mutually exclusive). This means that you also may need to prefix or suffix not only the aggregation mode, but also the rule (manual/regex/type) having created the aggregation - or handle duplicates some other way. I guess mutually exclusive aggregations might work out better for the UI...

Cheers
E

Hi Simon,

thanks a lot for the comment. I'm happy that you think about the worst case scenarios. This way I hopefully can prevent them from happen :-)

Regarding the regex type problem, when you specify a regular expression you can select the method to apply from all available methods e.g. there is no filtering of the methods based on a data type. However the node checks the type before execution and will ignore all columns that match the regular expression but not the required data type. So if you create a RegEx with the aggregation method Mean all columns that are compatible to Double are aggregated whereas all other columns that match the RegEx but which are not compatible to Double e.g. String are silently ignored. The node is quite failure tolerant since it will run as long as at least one group or aggregation column matches one of your specified criteria.

If have also renamed the “Column Based Aggregation” tab to “Manual Aggregation”.

 Bye,

Tobias

 

Hi,

I agree that the GUI gets a bit more complicated by allowing the three aggregation column selection methods simultaneously. However it makes the node much more powerful. You can specify the columns that really need to be present for aggregation in the manual selection column while at the same time handle additional columns dynamically using the RegEx or data type based selection methods.

The node ignores columns that match a criterion of a previous tab in the order they are displayed in the dialog e.g. group columns are ignored in all aggregation tabs, manual selected aggregation columns are ignored in the RegEx and data type tab and columns that match a RegEx criterion are ignored in the data type tab.

However if the same aggregation method is applied to the same column multiple times e.g. by two RegEx that match the same column and with the same aggregation method the output table will contain the column name with an index e.g. Mean(col1) and Mean(col1)_1.

Bye,

Tobias

Hello,

 

This is good news.  I assume a similar functionality will be applied to other nodes (which select columns), like the joiner node, pivot node and the loop-by-column node?

 

In-fact, I see that something like described already exists in the "Column Filter" node.  Basically, I'd like to see that functionality extended to ANYTHING that involves choosing columns :)  Heck, maybe even make it a standard GUI that folks can shove into their node GUIs for future reference as this is a very common problem.

 

In addition, I've noticed that in some of the nodes (for instance, the pivot node) - whenever a new column appears from the input, it immediately finds itself used in either the "group" or "pivot" tabs.  This strikes me as unintuitive as the inclusion of just one new column in the data can potentially impact the rest of the workflow without the user knowing.  Is this intended?

 

Ed.

 

Ed.

Hi Ed,

the component from the "Column Filter" node is available to all developers (see DialogComponentColumnFilter2 class). However not all nodes can simply use this component as a replacement that is why not all nodes use it yet. However sooner or later all column selection nodes e.g. pivot, joiner, etc. will support dynamic column selection.

We will also have a look at the problem with the default usage of unknown columns as pivot/group columns in the Pivoting node. This is not intended. Thanks for reporting this issue.

Bye,

Tobias

Tobias,

 

Thank you for letting me know :) 

I think I've also noticed the same issue with the GroupBy node but I don't know how to describe the problem as I've not seen it occur as consistently as with the Pivot - basically new columns sometimes get used as "grouping columns" which again can fettle results in a workflow.

 

Ed.

Hi Ed,

the Pivot node had the problem that it included unknown columns as pivoting columns whenever the node dialog was opened. This problem and some other smaller issues are fixed in the next release 2.10.1 which will be available this week. In order to update KNIME 2.10 to KNIME 2.10.1 open KNIME and go to Help->Check for Updates. The GroupBy node did not had this problem.

To summarize with KNIME 2.10.1 the GroupBy and Pivot node will not use unknown columns as pivot/group columns but simply ignore them whether the node dialog has been opened or not.

Bye,

Tobias

Tobias,

 

Thank you for letting me know.  I look forward to the new release :)

 

Ed.