I am building a model that do some processings on tables of sizes between 2 - 4 million records. It is great that KNIME executes nodes in parallel, but I do not have such a huge memory allocated for KNIME (I only allocate 1 GB for it, all the free memory I have). So, at times KNIME could not free up enough memory in time and everything just crashes. Now I resort to click run on subsets of nodes that I know can run in parallel without memory issue.
My question is: Is there any way I can configure KNIME to run only one node at a time, for the whole process? I hope this way, it will have enough time to clear some memory before next execution.
I don't know is that possible but for large tables with meny records try setting up node's 'Memory policy'. To do that when you configure particular node go to 'Memory policy' tab. There are 3 options:
- keep all in memory - means that all the data will be kept in RAM;
- keep only small tables in memory - just portions of data will be kept in RAM;
- write tables to disc - all data will be kept on harddrive;
Choosing third options should decrease memoty usage but also should decrease speed of the workflows.
I have switched all the large nodes to write to harddisk. But its still the same because in my setup, there are 10 nodes that is supposed to process the large records. And because these 10 nodes do not have prior operations, if I click Execute all, all 10 nodes starts simulatneously, and 20 millions records get processed. hahaha...
Well, if that is the case, then I would need to click one by one then. Thanks.
You can induce an execution order on these nodes by connecting them via the flow variable ports (right click any node, then select "Show Flow Variable Ports").
Can you elaborate on which nodes cause the memory problems? It shouldn't matter much how many nodes are executed in parallel (unless of cause these are ten memory intensive nodes such as the weka or certain other learner nodes).
...and you can set the number of threads used by KNIME in the preferences. The default is twice the number of CPU cores. Since each node is run in its own thread reducing this number also reduces the number of concurrently running nodes.
I have right clicked several nodes and there are no options for "Show Flow Variable Ports". I just show the regular Execute, Reset, Edit Name, etc. I am running KNIME 2.3.4.
My setup is like this, I am doing some matching of databases records so I have 1 database reader (a) fed into a Joiner node together with another database reader (b). (b) is just a small table with about 5k records while (a) is the large one with about 2-4 million records. (This could be done through database joins directly, but there are other regular operations as well Group By, Filter... I did not want to write a complex SQL statement so I used the nodes instead).
So, (a) joins with (b), that is no issue. It can run until the end of the regular process, no learner nodes or weka nodes, just plain simple nodes. hahaha.
But, the issue is, I have 10 different (a) tables to compare with 10 different (b)s. So instead of running once, change the table name and run again. I resorted to duplicating the same process 10 times. Now if I click on "Execute All", it will start loading the 10 (a) tables together.
It runs for awhile, then after that the whole KNIME just closes itself. And when I check the logs, it stated:
!MESSAGE An unexpected runtime error has occurred. The application will terminate.
java.lang.OutOfMemoryError: Java heap space
So I was wondering if there is a way to limit the execution so that it runs the 10 processes one by one.
Apart from just healing symptoms...
The problem of the OoM seems to be caused by the concurrently running Joiner nodes (they partially sort the data and we are currently investigating a memory problem if you run those simultaneously).
As for the missing "Show flow variable port" option: You have to enable the expert mode in KNIME v2.3.4 (but not in v2.4 because it will be enabled by default)
correct... limiting the thread has the effect of "queueing" the concurrent nodes... works for me now... hahaha...
This piece of information helped me a lot!
I have a flow in which a csv write node is attached to a SharePoint login node. The data input for the csv node takes about 3 hours to calculate, after which time the SPO login has expired again. The SPO node is executed right away, though, since it has no predecessors. Now, with the flow variable connection, it has!