KNIME 3.1: GroupBy Node- Weird Performance issue

Some time ago in KNIME 2 I created a workflow that should tax the system in terms of RAM and CPU resources. Bascially to compare systems with more or lesser hardware.Note that it doesn't make any sense expect to tax the system. It uses Data Generation node.

I now imported said workflow into KNIME 3.1 and on the exact same PC it runs around 15% slower than in KNIME 2.11. Different steps of the workflow are timed and so it is easy to seee that the GroupBy node is the culprit.

However the actual issue is more complex. The workflow generates data then the workflow is spilt. 1 branch uses a fraction of this data to build and precict a model. The other branch does a GroupBy on all rows. Hence the model building/predicting process and GroupBy run at the same time. If the whole model building branch is removed and hence groupby is the only thing that runs, the issue disappears.

So it's unclear if it is actually the GroupyBy node causign the issue or some change in the workflow manager (KNIME core) that cases this. However it was 100% repeatable on my machine with KNIME 3 always being slower and always the GroupBy node.

Please find attached both workflows, the full one and the one with groupby only to be able to repeat the issue (maybe it only happens to me?)

 

 

 

 

Hello,

thanks a lot for reporting this issue. We will have a look at it to see what causes the significant performance differences and get back to you as soon as possible.

Bye,

Tobias

Hello beginner_,

After executing your Workflow with both 2.12 and 3.1 I got the following results:
 

  2.12 3.1
Data Generation 8.964 8.754
GroupBy 67.156 61.876
Build Model 67.542 48.497
Predictl 1.752 2.117
Sum 145.415 121.245

You are absolutely right that in 2.12 the GroupBy-branch is faster than the learning-branch and that that changed as well. But the reason is different: In 3.1 the speed of the GroupBy-node improved, as well as the speed of the Tree Ensemble Learner. Latter a lot more than the former. This causes the GroupBy-node to be the most time-consuming part of the workflow, although it got faster.

Other tests support this explanation: GroupBy became faster, not slower.

Best,
Ferry

  2.11.3 3.1.1 2.11.3 (32-bit, 1 GB RAM)
Data Generation 9.924 10.121 17.185
GroupBy 31.087 73.06 48.702
Build Model 61.252 60.6 255.576
Predict 1.87 2.487 4.732
Sum 104.133 146.268 326.195

I can 100% reproduce the issue. In the last column is a slower PC (laptop) with 32-bit and only 1 GB RAM for KNIME. Here the workflow runs slower but the groupby node is still a lot faster than in 3.1.1. You results suggest the issue was introduced in knime 2.12 and not knime 3. The groupby caculates mean values for all clusters so maybe that code changed for the worse?

 

EDIT: Nope. I get similar result in KNIME 2.12 as in 2.11. So the change does really seem to happen with knime 3.