Fail to execute job when restarting the executor

Hello everyone

We have a job which is executed in everyday’s 8:30 am, It works well ago.
But today, the job failed to be executed.
After checking log, I found in 8:34 am, executor seems to be restarted.
The log message is like the following:

08:34  KNIME shutdown hooks - com.knime.enterprise.executor.GracefulShutdownHook : JVM terminate event received, canceling jobs and swapping them to the server.
08:34  ERROR: RMI TCP Connection: FileWorkflowPersistor: Unable to load node with ID suffix 1 into workflow, skipping it: null
08:34  ERROR: RMI TCP Connection: FileNativeNodeContainerPersistor: Could not load repository manager
The flow '/ooo' of user ooo could not be started: java.lang.NoClassDefFoundError: Could not initialize class org.knime.node2012.KnimeNodeDocument.

in the knime server config,
com.knime.server.executor.max_lifetime is set as 24h

My questions:
Did the job fail to be executed because of the restarting of executor?
How can we avoid the job failure?

KNIME Server Version: 4.8.2

Thanks in advance.

Ryu

Hi Ryu,

Restarting the executor in this case should not lead to loosing a job, since the executor was shutdown gracefully. This means that any jobs still on that executor are swapped before shutting down.

However, there is a chance that the shutdown was not triggered by the KNIME Server, but instead manually. Based on your log message, it looks like this is the case. Can you please confirm that?

Cheers,
Roland

Hi Roland

Thanks for your response.

We start work at 10 am,
So the job executed at 8:30 am should not be triggered by manual , and the job is triggered by the KNIME Server at 8:30 am everyday.

I also checked the log message in the past,
All the time of restarting the executor was not 8:30 am, so the job worked well?

Thanks.
Ryu

Hi Ryu,

Based on your logs, the executor was restarted at 8:34 am. Prior to restarting, it swapped all its jobs back to the server. Can you confirm whether the executor was shut down manually?

Cheers,
Roland

Hi @RolandBurger

Our company starts work at 10 am, So at 8:30 am , No one should be in the company, and none can access the production environment without an application.
Do you know how to confirm the executor was shutdown manually?

By the way, the KNIME job is executed from 8:30 am(StartTime) to 8:35 am(EndTime, almost) everyday by the KNIME Server, and the executor was restarted at 8:34 am)

The job’s workflow is simple, like the following:
␣␣␣␣␣␣␣␣␣␣␣KNIME Server Connection
␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣↓
Parrallel Chunk start -> Call Workflow(Table Based) -> Parrallel Chunk End -> External Tool
※ I am sorry for not uploading the workflow without customer’s permission.
※ Call Workflow(Table Based) is to call another workflow to export data to CSV file.
※ External Tool is to execute a windows batch to copy the csv file to another server.

Thanks.
Ryu

Hi Ryu,

It’s not easy to confirm where the shutdown command came from. While the server does automatically shutdown executors due to executor rotation, it will not shut down an executor that is still running jobs - this should never happen. Still, the message shown at 8:34 indicates that jobs are cancelled, hence why I suggested that this was manually triggered.

To try preventing this from happening, I first suggest to turn off executor rotation. You do this by setting com.knime.server.executor.max_lifetime to -1. This is actually the new default as of KNIME Server 4.10, so I wouldn’t expect any unwanted side effects.

Apart from that, have there been any other occasions where a job didn’t start since you first reported this?

Cheers,
Roland

Hi @RolandBurger

After turning off executor rotation, will the OutOfMemory exception will occur frequently when dealing with big data?

This issue is the first time happened, and the project is released in Ocotober this year.

Thanks
Ryu

No, there should not be an increased risk of this happening. If anything, the risk should be even lower, since you won’t run into a situation where you have two executors running in parallel, competing for available memory. (Which is one of the reasons why we switched the default to not do rotation)

Cheers,
Roland

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.