We’ve been doing some tests to compare the execution times of nodes in KNIME, (such as Joiner, Row Aggregator, and Row Filter) to equivalent functions in Python (using pandas). We’re using a large dataset with 7 columns and around 20 million rows. We’re finding that the execution times of nodes in KNIME are typically around 100 times longer than the equivalent Python functions.
We were expecting KNIME to be slower than Python, but have been surprised by how much. Can anyone explain why KNIME seems to be so much less efficient than Python? And are there any settings which can be changed which will significantly impact on its speed?
And to give a bit of background why pure Python can outperform KNIME in certain scenarios: Pandas uses a lot of numpy and other operations optimized in C, that work extremely efficient for a restricted set of data types (mostly numbers). KNIME on the other hand offers extensibility for data types (that’s why there are also molecules and images and …) and so all kinds of operations (e.g. sorting, grouping, …) need to account for this extensibility. All of that happens in Java which should be fast, but is probably not as fast as C. So for us there’s often a tradeoff between efficiency and extensibility, and while we strive to achieve both, you may run into situations where the efficiency is not optimal.
Thanks, that’s useful information. Are there other fundamental reasons why KNIME is likely to be slower? The post at Knime is slowing down - #2 by marc-bux suggests a key difference is that every KNIME node writes data to disk, and this adds execution time that isn’t incurred by most Python commands.
It’s good to hear you’re trying out KNIME. Unfortunately speed is not its strong point. Are you running your Python and Knime scripts locally or in the cloud? What is the power of your hardware? Helps to have context of how you’re trying to run it.
I run KNIME on my work laptop and generally do analysis and processing on data up to 5 million rows with 10 to 30 columns. Under 1 million rows it runs fast. 1-5 million it gets progressively slower, and above 5 million it becomes too slow for my puny laptop to handle. On a recent project, we had 10 million lines of data to process. Locally running it was painful. It took 5-10 minutes to run each node, and I had ~30 nodes. For 20 million lines I couldn’t do it.
We looked into running the script in the Knime Hub Cloud to speed up processing time. It worked quite well! Using the moderate computational speed our 10 million lines got processed in 11 minutes instead of 60+ minutes, and only cost us $5-$10AUD.
I’m still trying to figure out what makes Knime slow and how to speed it up. One thing that works great for me when setting up my code to is to run only a fraction of the data (100k rows) to help make for speedy development. Then if I need to rerun my data I’m not rerunning the full amount. Once I’m confident in the robustness of the script I’ll run the full amount.
I rarely use Python these days so I can’t compare its speed compared to Knime’s for 10 million rows. But for me the benefit of the drag and drop tools and highly visual interface outweigh the performance costs (most of the time). Hearing that the Knime nodes are taking 100 times slower than Python doesn’t sound right. It shouldn’t be more than 2-3 times slower in my experience. There might be a bottleneck somewhere.
Yes, Knime writes the output of each node to disk (I believe), which is great when you have small amounts of data but is excruciating for large amounts of data. That post you link is from 2018 and Knime has changed a hell of a lot since then, so I’m not sure how relevant it still is. I’m not sure how python does it in comparison.
There are chunk and parallel chunk loop nodes which you can take advantage of to speed up processing and to minimise how much data gets written to disk. I tried the chunk loop node when running my recent 10 million data script and it helped a little bit. Haven’t yet tried the parallel chunk node.