Pivot node is slow - Feature enhancement

beginner · November 15, 2019, 7:06am

Not much to say. It’s about 2.5 times faster to serialize the data, pivot in pandas and serialize back. (2 mio rows) compared to using the native KNIME node. And that is with serialization taking longer than the actual pivot. So the actual pivot is probably more like 5xtimes faster.

So if you guys have time to spare, probably worth a look if there is a quick win for performance as it probably could be a lot faster.

mlauber71 · November 15, 2019, 7:14am

Would it be possible to share an example workflow? That could go a long way for people to use your idea.

beginner · November 15, 2019, 7:54am

Hm, I would need some example data to pivot and then there are many ways to pivot but it’s a one-liner in python:

output_table = input_table.pivot_table(index=['group_columns'],columns='Name', values='Value', aggfunc='first').reset_index()

Where index is a list of columns to group on, columns is the column that contains the new column headers and values is the column with the, well, values (both can also be lists of columns). aggfunc is aggregation function, so if there are duplicates in Name, in this case take the first one (pandas first ignores None, NaN values). reset.index() for nice rowids in knime output.

But one needs to look at the pandas API and apply accordingly to your use-case. At least I do because I’m far from a pivot expert.

ScottF · November 15, 2019, 4:53pm

Hi @beginner -

Thanks for the heads up about this. I’ll see if I can get a developer’s attention so we can investigate further.

tobias.koetter · November 17, 2019, 1:27pm

Hi beginner,
have you enabled the “Process in memory” option in the Pivot node dialog?

This should speedup the execution dramatically. It is disabled by default to also work with data sets that can not be processed in memory. But with your 2 mio rows and first as aggregation method this shouldn’t be a problem.
If this option is not enabled the node first sorts the input table by the group and pivot columns in order to process each group one after the other.
Bye
Tobias

beginner · November 18, 2019, 5:18am

Indeed I haven’t activated that feature. I now tried with it activated but the effect surprisingly is rather small (25 vs 22 sec, 10s for python including serialization). I suspect the issue is my data as in many cases (the regular case) the group will consist of only 1 row but grouping is still needed because in the less common case there are multiple rows.

system · May 18, 2020, 5:18pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.