I’m doing a simple workflow of reading a file with 20 million rows and 18 columns, and a single-column grouping.
On Alteryx this process takes 3 minutes.
On Knime almost 20 minutes.
Why? Is there any way to improve this?
I’m doing a simple workflow of reading a file with 20 million rows and 18 columns, and a single-column grouping.
On Alteryx this process takes 3 minutes.
On Knime almost 20 minutes.
Why? Is there any way to improve this?
Hello @silasbatista,
and welcome to KNIME Community!
Hopefully we can as 20 does sound a lot! What KNIME version are you on? What kind of file are you reading and what nodes are you using?
Br,
Ivan
Hi @ipazin, thank you for your reply.
I’m using the version 4.4.0
The file type is txt and the workflow is very very simple: a file reader node and a groupby node.
Hello @silasbatista,
Are both nodes slow or one takes significantly more that the other? There is Timer Info for these kind of measurements.
And how big is file on disk? How many memory you have assigned to KNIME? (See here how to find that and increase for better performance.)
Br,
Ivan
Hi @silasbatista and welcome to the Knime Community.
Adding to the questions that @ipazin is asking, what kind of operations are you doing in the GroupBy? Aggregation? Count? On multiple columns?
This is the time:
And this is the aggregation:
My RAM is 8GB and on KNIME.ini I specified -Xmx6144m
The size file is 4.3GB
File reader is streamable. Did you try it to convert in component and stream?
@izaychik63 This is the time with streaming:
It looks like memory issue. Please look at KNIME optimization topics.
Just to make sure,
FileReader is no the former “Simple FileReader” which is faster and it is also faster then csv reader node correct?
It is former Simple File Reader.
You could try the “Process in Memory” option of the groupby node. On top what might help is to apply a column filter first before the groupby and a cache node. to possibly limit the amount of memory used.
Hello @silasbatista and others,
couple of comments/notes:
So to answer your original question there is a way to improve performance and there is more to come! Additionally as an Alteryx user I share topic which might help you:
Br,
Ivan
Ivan, from my experience streaming helps specifically for stand along File Reader (my v 4.5 nightly).
For 3m rows on 8 gb PC I have less then 2 min load in streamed component and 3 min without streaming.
seems you are right @izaychik63
Ivan
We have identified the cause for the slowdowns in 4.4. Take a look for a quick fix here:
best,
Gabriel
This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.