Ideas to limit CPU usage of executors? Specific jobs are shutting down executors

misterhd · October 26, 2022, 1:17pm

Hi everyone,

We have rare jobs that can actually provoke a shutdown of an executor due to high CPU load.

After running for ~30 minutes, the executor is automatically shut down. This is how the log looks like before the shutdown:

WARN : Consumer reconnect poller : : WorkMessageDispatcher : : : CPU usage (1451.375) above threshold (90.0)

I have read other topics and there is no direct way to control the CPU usage.

This is quite important for us because if a “killer” job is scheduled every hour, this job can shut down all the executors we have in a few hours.

We have distributed executors of 32GB and 8 cores each.

Is there any recommendation about how can we proceed to avoid this issue in the first place?

Thank you.

NDekay · November 1, 2022, 1:55pm

Hello @misterhd ,

In the KNIME Admin Guide [1] , there are parameters that can be added to the executor to control max CPU and heap utilization thresholds.

-Dcom.knime.enterprise.executor.heapUsagePercentLimit=<value-in-percent e.g. 90>

The average Heap space usage of the executor JVM over one minute. Default 90 percent

Env: KNIME_EXECUTOR_HEAP_USAGE_PERCENT_LIMIT=<value-in-percent e.g. 90>

-Dcom.knime.enterprise.executor.cpuUsagePercentLimit=<value-in-percent e.g. 90>

The average CPU usage of the executor JVM over one minute. Default 90 percent.

Env: KNIME_EXECUTOR_CPU_USAGE_PERCENT_LIMIT=<value-in-percent e.g. 90>

This only controls if it accepts new jobs. “By default an Executor will not accept new jobs any more if its memory usage is above 90% (Java heap memory) or the average system load is above 90% (averaged over 1-minute).”

Once the executor accepts a job, if it is driving the CPU utilization up to 1451%, then I’d say you need to look at the workflow specifically and see what it is doing and figure out how to either break it down into smaller chunks or split it into separate callable workflows that can take advantage of your DE setup.

Regards,
Nickolaus

[1] KNIME Server Administration Guide

misterhd · November 2, 2022, 2:39pm

Hi NDekay.

Thanks for your feedback. We will have a look at the parameters mentioned. However, as you mentioned, this will “only” avoid that that new jobs get accepted by the executor, so I think the 90% default parameters are already good enough.

It would be great (as a wish list feature) to have a way to control the CPU / Memoy usage so that the executor could actually cancel jobs automatically. For example, a setting that allows the executor to automatically cancel jobs if these have been running for “X” time utilizing more than 90% of CPU / memory.

The reason is that, thinking in scale with hundreds / thousands of users, it is definitely difficult to control all the workflows being executed in the server. A “malicious / clueless” user could for example schedule a high CPU schedule every hour during the weekend, which would basically shutdown all the executors and affect all the other hundreds of users. As far as I know, this scenario cannot be currently controlled by the admin.

In the meanwhile, a potential solution is to automatically check the status of the executors and restart them if they are inactive. But it would be great to avoid the issue happening in the first place.

Anyway, thanks for your feedback!

NDekay · November 2, 2022, 8:20pm

Hello @misterhd ,

There are a lot of options that are configurable (see [1]); relevant to you wanting to cancel jobs runing for longer than X time, please see:

com.knime.server.job.max_execution_time=<duration with unit, e.g. 60m, 36h, or 2d> [RT]
Allows to set a maximum execution time for jobs. If a job is executing longer than this value it will be canceled and eventually discarded (see com.knime.server.job.discard_after_timeout option). The default is unlimited job execution time. Note that for this setting to work, com.knime.server.job.swap_check_interval needs to be set a value lower than com.knime.server.job.max_execution_time.

com.knime.server.job.swap_check_interval=<duration with unit, e.g. 30s, 1m, or 1h> [RT]
Specifies the interval at which the server will check for inactive jobs that can be swapped to disk. Default is every 1m.

com.knime.server.job.discard_after_timeout=<true|false> [RT]
Specifies whether jobs that exceeded the maximum execution time should be canceled and discarded (true) or only canceled (false). May be used in conjunction with com.knime.server.job.max_execution_time option. The default (true) is to discard those jobs.

Regards,
Nickolaus

[1] KNIME Server Administration Guide

system · January 31, 2023, 8:20pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.