I was just wondering if anyone could provide some hardware/KNIME settings advice.
I have a very large workflow that begins with a dataset ~ 40k rows in sizes, to which I apply a recursive loop process that builds a data set to a size of ~ 9m rows.
In the next stage, it takes the 9m row dataset and applies numerous calculations (growing the dataset to a total of 70 separate columns), one after the other. Usually these are done by Math nodes, rule engines, time differences or time shifts. Each node takes at least 1-2 minutes to complete, often more.
At the end of the process the data is summarized using GroupBy nodes into various reports. One of these GroupBy nodes alone takes ~ 36 minutes to complete.
On my PC with a 3.6GHz processor and 12GBs of RAM, this whole workflow is taking ~ 4hrs or more to run. Tweaking the model and getting new outputs has therefore become a nightmare.
In an effort to get the model to run much faster, I installed KNIME desktop directly to a decent server we intend to use for a database that is not yet up and running. The server has masses more RAM (72GB) available so I expanded the heapspace available to around 46GB.
However, the workflow actually runs slower and has failed to finish twice - whereas it always completed on my PC without issue (albeit slowly).
I am a huge fan of KNIME and trying to get the business to consider the server version of this software. But if I can't get a 9m row model to run in a reasonably short amount of time (<30 minutes) then I fear I'll fail with this.
Would anyone be able to offer any advice with this predicament? I am an analyst (not an IT person) but I need to communicate something to my IT resources that they will be able to implement.
I guess you have increased the heap size (KNIME is based on Java and Java requires the memory assigned to the process defined up-front)?
Loops might be slow (in particular recursive loops). It might help if you run some of the trivial work that you do within the loop in streaming mode to avoid extra I/O (so optimize the part that takes 1-2 minute per column).
I used the streaming technique after updating KNIME and it dropped the time taken on the long string of consecutive node calculations from 86 minutes to 50 minutes. So this was a good saving thanks!
One thing I noticed was that a number of nodes I am using (mainly time difference and time shift nodes) were not covered by the streaming technique. If they were I think the time saving would have been much greater since these nodes create bottlenecks as a result.
I was just wondering why as there are no inter-row dependencies in these nodes (unlike a GroupBy node, for example) - presumably these will be optimised for the wrapped streaming in due course?
Many thanks again for your assistance with this, really like this streaming technique.
I guess you have increased the heap size (KNIME is based on Java and Java requires the memory assigned to the process defined up-front)?
-------- - Yes, increased the heapspace for my own PC to ~ 9GBs and the server's heapspace was increased to ~46GB. The total available RAM on the server is 125GB.
---------- - I also notice that the server has a smaller processor (1.6GHz on server vs 3.60Ghz on the PC) but more cores (6 on server vs. 4 on PC).
----------- - I guess it depends on how well the nodes utilize the resources on the server?
It might help if you run some of the trivial work that you do within the loop in streaming mode to avoid extra I/O
----Thanks, I had not heard of streaming mode so this could help! Much appreciated.
-----If this does not add significantly to performance I think in the end, I may have to write the data to a DB and do calculations groups by, pivots etc in there - then suck it back out of the DB into the workflow. A bit inelegant, but will probably work as the DB is very fast.