Hi @exceluser, and welcome to the KNIME community,
Firstly I’m pleased to hear that you found the webinar useful. Thank you for watching!
Other people are better placed than me to discuss the internal workings of KNIME, but if you read in a 1,000,000 row file, that file is only read one time, which is when the File Reader/Excel Reader/CSV Reader/whatever executes.
After this, the rows from that file are held in memory as a data table and all subsequent processing acts on the data held in memory. Of course, if you are short on memory, then there may be some paging to disk required, but in general provided you have sufficient memory, this is where it resides.
Each node in the workflow acts on that data in turn, supplying results (also held in memory) to the nodes which are downstream. By default, these actions are sequential so one node does its thing and then control is passed to the next node. There are possibilities with some nodes to do “stream processing”, but although I have tried that out, I’m by no means experienced in doing that, but it can result in improved performance. However there are limitations in that any node that requires the entire data set to be able to complete its actions will not be able to stream to subsequent nodes.
So in general, “out of the box”, KNIME neither reads each file for each step, nor does it process each line through the entire workflow. Over-simplified, you can imagine that each node receives the entire data supplied to it from the previous node, and this is all in memory. It then processes it and passes the output onwards to the next node.
The amount of memory made available to KNIME has a big impact on performance, and can be configured in the KNIME.INI file (in the KNIME installation folder). Adding a line such as -Xmx20G to this file will set the available max memory to 20 GB. (Which is what mine is set to on my PC with overall system memory of 32GB. With larger systems, and higher data volumes, I know there are others who set theirs much higher than this! You must make sure not to set this so high that it gets allocated too much memory from your system (and thus affects overall system performance), but set it sufficiently high that KNIME has room to work.
In comparison to writing a script using, say, python, I would say that a dedicated native script for the purposes of specific processing is likely to outperform KNIME in a basic speed test, but obviously it would depend on how well the script is written. However, it really depends on your needs. I can put together a KNIME flow that is pretty much guaranteed to work to do some quite complex file handling and data manipulation much faster than I’d be able to write, and test a piece of python or java. And the great thing is that if I need to tweak it, I can do so without recoding and lots of re-testing.
So on to practical numbers. I didn’t have a 1m row file to hand, so I simulated it. I could have written it out to a file and read it back in, but the point is how fast will KNIME process the data…
I used a data generator node to generate 2million records consisting of 4 random numbers in the range 0.0 to 1.0. I then converted these all to Strings, sorted them, performed some regex, sorted them again in a different order, and combined them into an additional single column.
Reading in a file would take a short time, depending on file system and records, but once in memory, the rest is the same. The timings for the node executions in milliseconds are shown. So the greatest time to process was the two sorters at around 7.5 and 9 seconds. I just knocked that workflow together to use as an example here. I cannot imagine how long it would have taken me to do all of those actions in a script for demo purposes
I hope that helps, but what I’d suggest most of all is give it a try, and see if it suits your needs. I haven’t looked back since I started using KNIME, and even as a former java developer, it has totally revolutionised how I work with data and databases.