Automate Away Spreadsheet Tasks with Trusted KNIME Community Contributor: Knime optimization

The event: “Automate Away Spreadsheet Tasks with Trusted KNIME Community Contributor” was very informative.

I am new to KNIME and thinking about use cases. Apologies if this topic has been covered elsewhere.

I have a question about KNIME internal optimization: If I have a workflow with multiple steps - splitting, regular expressions, etc., does KNIME read the entire input file for each step and process it separately, or does each input line go through the entire workflow as it is read?

My concern is the processing time for large input files. If a workflow has 10+ steps and the input file is 1 million records, the processing time to read the file 10 times could be very slow compared to a script that read a CSV file and processed it in one pass.

Hi @exceluser, and welcome to the KNIME community,

Firstly I’m pleased to hear that you found the webinar useful. Thank you for watching!

Other people are better placed than me to discuss the internal workings of KNIME, but if you read in a 1,000,000 row file, that file is only read one time, which is when the File Reader/Excel Reader/CSV Reader/whatever executes.

After this, the rows from that file are held in memory as a data table and all subsequent processing acts on the data held in memory. Of course, if you are short on memory, then there may be some paging to disk required, but in general provided you have sufficient memory, this is where it resides.

Each node in the workflow acts on that data in turn, supplying results (also held in memory) to the nodes which are downstream. By default, these actions are sequential so one node does its thing and then control is passed to the next node. There are possibilities with some nodes to do “stream processing”, but although I have tried that out, I’m by no means experienced in doing that, but it can result in improved performance. However there are limitations in that any node that requires the entire data set to be able to complete its actions will not be able to stream to subsequent nodes.

So in general, “out of the box”, KNIME neither reads each file for each step, nor does it process each line through the entire workflow. Over-simplified, you can imagine that each node receives the entire data supplied to it from the previous node, and this is all in memory. It then processes it and passes the output onwards to the next node.

The amount of memory made available to KNIME has a big impact on performance, and can be configured in the KNIME.INI file (in the KNIME installation folder). Adding a line such as -Xmx20G to this file will set the available max memory to 20 GB. (Which is what mine is set to on my PC with overall system memory of 32GB. With larger systems, and higher data volumes, I know there are others who set theirs much higher than this! You must make sure not to set this so high that it gets allocated too much memory from your system (and thus affects overall system performance), but set it sufficiently high that KNIME has room to work.

In comparison to writing a script using, say, python, I would say that a dedicated native script for the purposes of specific processing is likely to outperform KNIME in a basic speed test, but obviously it would depend on how well the script is written. However, it really depends on your needs. I can put together a KNIME flow that is pretty much guaranteed to work to do some quite complex file handling and data manipulation much faster than I’d be able to write, and test a piece of python or java. And the great thing is that if I need to tweak it, I can do so without recoding and lots of re-testing.

So on to practical numbers. I didn’t have a 1m row file to hand, so I simulated it. I could have written it out to a file and read it back in, but the point is how fast will KNIME process the data…

I used a data generator node to generate 2million records consisting of 4 random numbers in the range 0.0 to 1.0. I then converted these all to Strings, sorted them, performed some regex, sorted them again in a different order, and combined them into an additional single column.

Reading in a file would take a short time, depending on file system and records, but once in memory, the rest is the same. The timings for the node executions in milliseconds are shown. So the greatest time to process was the two sorters at around 7.5 and 9 seconds. I just knocked that workflow together to use as an example here. I cannot imagine how long it would have taken me to do all of those actions in a script for demo purposes :wink:

I hope that helps, but what I’d suggest most of all is give it a try, and see if it suits your needs. I haven’t looked back since I started using KNIME, and even as a former java developer, it has totally revolutionised how I work with data and databases.

4 Likes

@exceluser I think @takbb already has pointed out the most important aspects. A graphic interface will always need some resources for the benefit of ease of usage and (automated) documentation and accessibility.

Concerning CSV files I think there are some improvements coming to the reader in order to be able to even handle gigantic files.

Then in all systems you might need some sort strategy how to deal with very large files (hence the rise of Big Data systems - which KNIME can also work with). If you want to have them locally, compression and partitioning might be the way to go. Also I have used strategies of splitting up KNIME workflows where one initial workflow might just do some loading and preparation in order to save on RAM.

But maybe best to give it a try. If you machine is capable KNIME is a very powerful tool.

Here are a few notes on what you could explore.

4 Likes