Knime on Linux - Performance and Stability

DemandEngineer · December 9, 2020, 7:29am

Hi All,

Looking for some advice. My PC is usually a Windows Box with 16gb ram… have had some workflows that have used us the heap and processing slows down. I figured if I run Lubuntu or Archlinux the OS would have a smaller footprint and I could have more RAM available.

Is Knime as stable or more on Linux? will I gain performance from lower overhead from OS (RAM and CPU?)

Is it worth it to setup a dual boot? Anything to be aware of?

gab1one · December 9, 2020, 9:01am

Hi @DemandEngineer,

KNIME AP runs well on Linux and Windows. If you have a setup that allows you to give it more Memory then this can improve processing times, depending on what kind of nodes you are using and what is the performance bottleneck. Before you set up the dualboot though, have you tried out the columnar table backend? That should give you a significant performance increase, and we would love to hear your feedback on it. https://www.knime.com/blog/improved-performance-with-new-table-backend

best,
Gabriel

DemandEngineer · December 9, 2020, 8:45pm

Perhaps my workflow is not complex enough or I am bottlenecked by something else. I had 4.3M rows of data but only 20 columns anywhere in the process. Using the regular table and new table back (auto) and manual settings to prioritize performance and increase cache size… yielded virtually the same times (except when I allocated more cache than available RAM ). 11min for all except for the over allocation.

I think my bottleneck is in the Decompose Signal Component inside a Loop… not sure if the backend should help with this.

Here is a temporary link to the workflow (incomplete)… I’ll delete in a week.

Daniel_Weikert · December 10, 2020, 6:18pm

Don’t know your flow in detail and I am not an expert here but loop in general are slow

gab1one · December 11, 2020, 10:22am

You can experiment with using a Parallel Chunk Start to parallelize the execution, this can lead to a speedup depending on the amount of compute / memory you have available.

DemandEngineer · December 22, 2020, 4:39pm

Hi @gab1one, Parallel Chunk Start seems very interesting but I’m not sure how to use in my scenario… Currently, I am using a Group Loop Start as I need to process all rows with a certain ID to group for the Time Series by store. Ideally, I would want to process in parallel each group but the Parallel Chunk Start doesn’t have a group function and can only use # of rows. Is there a node that does this? If no, I’d like to add this as a feature request.

ipazin · January 11, 2021, 10:48am

Hello @DemandEngineer,

there is already a feature request for it and will add +1 for you (Internal reference: AP-4817). In the meantime check for a workaround here:

Br,
Ivan

ana_ved · January 11, 2021, 10:57am

How I achieve something similar: I make a list of groups, use the parallel loop chunks, and then I do a reference row filter in the begining of the chunk to get only records specific to a group. This is basically the same as it would be with a parallel group chunking loop.

DemandEngineer · January 12, 2021, 4:15am

Thanks so much for the suggestion. If I understand correctly this may not work for me as one of the last steps of the look is a “group by” node to sum and aggregate the calculated values. Now having said that maybe I should have a loop inside a loop so I move the group by to the outer loop? and the inner loop has the parallel loop start… hmm… will have to try it.

kienerj · January 12, 2021, 4:40am

Loops in KNIME are especially very slow. Very often it pays off to use a Python snippet even with the serialization penalty and looping inside python.

DemandEngineer · January 13, 2021, 5:15am

Thanks for the suggestion but I’ve only taken an introductory python course and not sure I know how to use it with Knime. Do you have any suggested resources for a beginner in Python?

kienerj · January 13, 2021, 5:49am

Learning by doing is probably the best approach albeit the complications start with setting everything up correctly. Be prepared to require a lot of time just for that. I mean it all depends if the looping performance negtivley affects your work, if you can just run it in the background, not much of an issue.

Also it only makes sense that whatever you are calculating is just basic math or an according library is available in python.

mlauber71 · January 15, 2021, 4:52am

@DemandEngineer I do not know a particular course but I have set up a collection of links and examples of how to use KNIME and Phyton.

They might give you an idea how to set up and use the tools together once you have tasks at hand that you want to do.
https://hub.knime.com/mlauber71/spaces/Public/latest/_knime_and_python_meta_collection

There are myriads of courses for Python out there. One interesting approach is what the actual NSA has put together:

system · July 16, 2021, 4:52pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.