We have increased the number of jobs running on our KNIME server, currently running up to 4 RMI instances, and we are getting the following notification at what should be the completion of the job
KNIME Server job has disappeared from the flow
The jobs are set up with email notification and this message is received by the mail recipients. The job does not complete, seems to have stalled somehow.
Has anyone come across this or have any idea what may be the issue?
First note that there is only one RMI instance that executes (new) workflows. All other instances are only waiting until their jobs are discarded. So I'm not sure what you mean with "running up to 4 RMI instances".
You get the error message if the server cannot find a job in its RMI instance any more. This can happen if the instance crashes, e.g. due to low memory. You should have a corresponding message in the server's log file in such cases.
Can you point us to where the server log file should be to check, and what error message should we be looking out for? Is it a out of memory message or would it be a RMI instance not found?
with the rmi instances we were set a limit
in the knime-server.config file as we saw up to 9 runtime-knime-rmi-5010x folders, so we thought maybe there were too many instances running. However, from what you say this probably wasn't the cause.
As for memory,
our settings in the knime-rmi.ini file is
we are running on a vm with 20G memory, and as described
We are running some short jobs which repeat every 10 minutes, with other longer jobs running taking more than 10 minutes to execute.
We tried to look through the logs and find out what may have happened. Can you advise if by running out of memory whether this is
1. an running job that, says it reads in a lot of data, runs out of memory
2. a new job trying to start but there is not sufficient memory for it to get started?
If it is the first, is there a way with a KNIME node to monitor the memory available and therefore avoid it from crashing? It seems surprising that the executor would allow it to run out of memory rather than simply end up swapping to disc.
There is absolutely no way of telling how much memory a workflow will needs. This depends on the used nodes, their settings, the data, etc. Therefore it's simply not possible to decide whether a workflow will run without problems or not. Even if memory is low when a job is being started, another workflow may free its memory in the next moment. Also if memory gets low, we swap most data tables to disk (not models, though).
Having said that you you see message in both the server's log and the executor's log (in the corresponding runtime folder) if the executor ran out of memory. Not that this doesn't mean that it crashes, it may only be unresponsive until memory gets freed. In such cases the server thinks that the job is no longer existing and issues the message you are seeing.
We notice on the Analytic Platform that a KNIME process does not garbage collect even if the memory is no longer used, and reading some of the older posts it seems that it does so if the memory is again needed. (https://tech.knime.org/node/20624) What happens on the server if there are say two tasks scheduled to run adn overlap, say with the first one initially requiring 15G of a 20G machine, but say only at the start and then settles to 5G, and then the second say requiring 10G. Would the memory from the first get freed for the 2nd task automatically? If not, is there a node to prompt the first task to free memory after a memory intensive step?
If the workflows run inside the same Java Virtual Machine then memory is automatically garbage collected when necessary. This usually works quite well.
If you have two independent Java processes (e.g. an outdated executor that only holds not-yet-discarded jobs but doesn't accept new jobs any more plus an active executor) then things are different. Java usually doesn't free memory it has claimed from the operating system and you may indeed run into memory issues (on the OS level).
Is it sensible to set com.knime.server.executor.max=1 in the knime-server.config file so that all workflows run inside the same Java VM to avoid having outdated executors? Or would this create other problems?
If you always want to use the same executor (for as long the server runs) I suggest to set com.knime.server.executor.max_lifetime=-1. Then no second executor will be started and all workflows will be executed in the first executor. However, in case there are memory leaks in workflows or nodes, this single executor may run out of memory at some point in time (although there are no known issues in this respect with our nodes). On the other hand, you can give it more memory since only one process runs on the system.