PC/KNIME Specifications and Slow Workflow

mrman101 · November 11, 2016, 4:47pm

Hi All,

I was just wondering if anyone could provide some hardware/KNIME settings advice.

I have a very large workflow that begins with a dataset ~ 40k rows in sizes, to which I apply a recursive loop process that builds a data set to a size of ~ 9m rows.

In the next stage, it takes the 9m row dataset and applies numerous calculations (growing the dataset to a total of 70 separate columns), one after the other. Usually these are done by Math nodes, rule engines, time differences or time shifts. Each node takes at least 1-2 minutes to complete, often more.

At the end of the process the data is summarized using GroupBy nodes into various reports. One of these GroupBy nodes alone takes ~ 36 minutes to complete.

On my PC with a 3.6GHz processor and 12GBs of RAM, this whole workflow is taking ~ 4hrs or more to run. Tweaking the model and getting new outputs has therefore become a nightmare.

In an effort to get the model to run much faster, I installed KNIME desktop directly to a decent server we intend to use for a database that is not yet up and running. The server has masses more RAM (72GB) available so I expanded the heapspace available to around 46GB.

However, the workflow actually runs slower and has failed to finish twice - whereas it always completed on my PC without issue (albeit slowly).

I am a huge fan of KNIME and trying to get the business to consider the server version of this software. But if I can't get a 9m row model to run in a reasonably short amount of time (<30 minutes) then I fear I'll fail with this.

Would anyone be able to offer any advice with this predicament? I am an analyst (not an IT person) but I need to communicate something to my IT resources that they will be able to implement.

Any help would be greatly appreciated!

wiswedel · November 14, 2016, 10:58am

Thanks for all the details.

I guess you have increased the heap size (KNIME is based on Java and Java requires the memory assigned to the process defined up-front)?

Loops might be slow (in particular recursive loops). It might help if you run some of the trivial work that you do within the loop in streaming mode to avoid extra I/O (so optimize the part that takes 1-2 minute per column).

Bernd

wiswedel · November 14, 2016, 11:00am

Btw, this does not explain why the workflow would fail to run on the server unless you have set different values for heap size etc. (Have you?)

mrman101 · November 16, 2016, 10:15am

Hi

I used the streaming technique after updating KNIME and it dropped the time taken on the long string of consecutive node calculations from 86 minutes to 50 minutes. So this was a good saving thanks!

One thing I noticed was that a number of nodes I am using (mainly time difference and time shift nodes) were not covered by the streaming technique. If they were I think the time saving would have been much greater since these nodes create bottlenecks as a result.

I was just wondering why as there are no inter-row dependencies in these nodes (unlike a GroupBy node, for example) - presumably these will be optimised for the wrapped streaming in due course?

Many thanks again for your assistance with this, really like this streaming technique.

mrman101 · November 18, 2016, 9:56am

Hi,

Many thanks for your response to this!

To answer your questions: -

I guess you have increased the heap size (KNIME is based on Java and Java requires the memory assigned to the process defined up-front)?

-------- - Yes, increased the heapspace for my own PC to ~ 9GBs and the server's heapspace was increased to ~46GB. The total available RAM on the server is 125GB.

---------- - I also notice that the server has a smaller processor (1.6GHz on server vs 3.60Ghz on the PC) but more cores (6 on server vs. 4 on PC).

----------- - I guess it depends on how well the nodes utilize the resources on the server?

It might help if you run some of the trivial work that you do within the loop in streaming mode to avoid extra I/O

----Thanks, I had not heard of streaming mode so this could help! Much appreciated.

-----If this does not add significantly to performance I think in the end, I may have to write the data to a DB and do calculations groups by, pivots etc in there - then suck it back out of the DB into the workflow. A bit inelegant, but will probably work as the DB is very fast.

Iris · January 30, 2017, 2:29pm

Hi Mrman,

the nodes need to be updated. I opened a feature request for this for you and will let you know as soon as there is an update.

Best regards, Iris

Iris · January 30, 2017, 2:52pm

Hi Mrman,

actually our developers just told me that this will be part of our rewrite of the time nodes, which they are currently working on.

Best, Iris