Hundreds of Rserve.dbg processes issue

Hello,

I am running Knime 3.1 on a Ubuntu box. I have it setup to run GUI-less Knime jobs every hour as a cron task. My Knime workflow runs several R Snippets / R View (Table) nodes from within a Parallel loop. 

I notice that when running the workflow, several new Rserve.dbg processes are spawned (as expected). However at the successful conclusion of the Knime script, there are several of these Rserve.dbg processes still around. After the cron job has been invoked over several days, hundreds of these Rserve.dbg processes accumulate. Eventually so many Rserve processes exist that it consumes all memory and no more can be spawned -- causing the knime script to fail.

I can manually kill these processes every couple of days, but I am hoping someone might recognize the situation and be able to provide some advice on how to prevent this from happenneing in the first place.

Thank you.

Well, one way to deal with this issue would be to run KNIME and R together inside of a docker container, so that your instance would be completely (mostly) self contained. Thus, you could, for example, run a cron job to first kill all old docker containers and then boot up a new docker container. This would keep things nice and clean.

Although what I am suggesting may well be overkill for what you want, dockerzing KNIME could have extra benefits for you as well. It would take some work to get everything sorted out, but it is doable. If you know what Docker is, you can check out the thread below for more information, and I am happy to help you further as well. If you have no idea what Docker is but are still interested, you will want to do some reading about the concept first.

https://tech.knime.org/forum/knime-general/knime-in-docker

-Brian Muchmoore

Hi joshuahoran, 

this is not supposed to happen, we payed special attention to have all the R processes terminate with the KNIME Analytics Platform processes. This can only work if the process terminates naturally, though.

Unused Rserve processes are terminated after 60 seconds, a workaround for now could be to pause workflow execution after completion for about that time before having the Analytics Platform terminate. This would result in all R processes being cleaned up beforehand. (TimeDelay should work well)

Could you please verify that the KNIME AP processes terminate successfully (don't crash). It would be great if you could also try running the workflow in the GUI and see if the R processes still stay up after that closes.

Regards, Jonathan.

Thanks for the repsonses everyone. I am going to dig in a little and check to see if anything is exiting abnormally -- which might explain the orphan processes. For clarification, is"JVM Terminated. Exit code = 4" indicative of a sucessful "non-crash" knime termination?

Thanks.

A quick update: I'm still trying to track down the source of this issue, but I might have a lead.

If I run the workflow through the GUI, I notice that often times I will get the following error from the R nodes that are being run within a parallel loop:

Execute failed: Exception occured during R Initialization

I think that when this occurs, an Rserve.dbg process becomes orphaned and this is what causes the accumulation of unattached processes over time.

I don't know why the R nodes periodically fail. It does't seem to correlate with anything else. For example when I receive the failure error I can often times immediately re-execute the node and it will work fine.

I'll keep digging.

Final Update: I was never able to figure out how to curtain the orphaned Rserve processes. Since the huge number of orphaned pocesses would routinely crash my computer every 3rd day, I hacked my way to a resolution by setting up a cron job to run every morning that pkills all Rserve processes. This is a little dangerous since it is possible that I could end up killing an active Rserve. One way to get around this would be to selectively kill processes named Rserve.dbg which are also in a sleep state. The implementation of this is beyond my skillset presently, but if anyone has any ideas I would love to hear them.