Processing time takes too long

I’m doing a simple workflow of reading a file with 20 million rows and 18 columns, and a single-column grouping.
On Alteryx this process takes 3 minutes.
On Knime almost 20 minutes.

Why? Is there any way to improve this?

Hello @silasbatista,

and welcome to KNIME Community!

Hopefully we can as 20 does sound a lot! What KNIME version are you on? What kind of file are you reading and what nodes are you using?

Br,
Ivan

1 Like

Hi @ipazin, thank you for your reply.
I’m using the version 4.4.0
The file type is txt and the workflow is very very simple: a file reader node and a groupby node.

knime_

Hello @silasbatista,

Are both nodes slow or one takes significantly more that the other? There is Timer Info for these kind of measurements.

And how big is file on disk? How many memory you have assigned to KNIME? (See here how to find that and increase for better performance.)

Br,
Ivan

1 Like

Hi @silasbatista and welcome to the Knime Community.

Adding to the questions that @ipazin is asking, what kind of operations are you doing in the GroupBy? Aggregation? Count? On multiple columns?

This is the time:

time

And this is the aggregation:

My RAM is 8GB and on KNIME.ini I specified -Xmx6144m
The size file is 4.3GB

File reader is streamable. Did you try it to convert in component and stream?

@izaychik63 This is the time with streaming:

stream

It looks like memory issue. Please look at KNIME optimization topics.

Just to make sure,
FileReader is no the former “Simple FileReader” which is faster and it is also faster then csv reader node correct?

It is former Simple File Reader.

1 Like

You could try the “Process in Memory” option of the groupby node. On top what might help is to apply a column filter first before the groupby and a cache node. to possibly limit the amount of memory used.

2 Likes

Hello @silasbatista and others,

couple of comments/notes:

  • memory increase should help (6 GB for such wide and big table is not much)
  • streaming helps only in case of multiple chained nodes with streaming functionality (not case here)
  • there is a new table backed for more performance - you can try that out
  • some performance issues in v4.4 are being investigated which might be behind this as well

So to answer your original question there is a way to improve performance and there is more to come! Additionally as an Alteryx user I share topic which might help you:

Br,
Ivan

1 Like

Ivan, from my experience streaming helps specifically for stand along File Reader (my v 4.5 nightly).
For 3m rows on 8 gb PC I have less then 2 min load in streamed component and 3 min without streaming.

2 Likes

seems you are right @izaychik63 :+1:
Ivan

We have identified the cause for the slowdowns in 4.4. Take a look for a quick fix here:

best,
Gabriel

5 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.