In the KNIME Admin Guide [1] , there are parameters that can be added to the executor to control max CPU and heap utilization thresholds.
-Dcom.knime.enterprise.executor.heapUsagePercentLimit=<value-in-percent e.g. 90>
The average Heap space usage of the executor JVM over one minute. Default 90 percent
Env: KNIME_EXECUTOR_HEAP_USAGE_PERCENT_LIMIT=<value-in-percent e.g. 90>
-Dcom.knime.enterprise.executor.cpuUsagePercentLimit=<value-in-percent e.g. 90>
The average CPU usage of the executor JVM over one minute. Default 90 percent.
Env: KNIME_EXECUTOR_CPU_USAGE_PERCENT_LIMIT=<value-in-percent e.g. 90>
This only controls if it accepts new jobs. “By default an Executor will not accept new jobs any more if its memory usage is above 90% (Java heap memory) or the average system load is above 90% (averaged over 1-minute).”
Once the executor accepts a job, if it is driving the CPU utilization up to 1451%, then I’d say you need to look at the workflow specifically and see what it is doing and figure out how to either break it down into smaller chunks or split it into separate callable workflows that can take advantage of your DE setup.
Thanks for your feedback. We will have a look at the parameters mentioned. However, as you mentioned, this will “only” avoid that that new jobs get accepted by the executor, so I think the 90% default parameters are already good enough.
It would be great (as a wish list feature) to have a way to control the CPU / Memoy usage so that the executor could actually cancel jobs automatically. For example, a setting that allows the executor to automatically cancel jobs if these have been running for “X” time utilizing more than 90% of CPU / memory.
The reason is that, thinking in scale with hundreds / thousands of users, it is definitely difficult to control all the workflows being executed in the server. A “malicious / clueless” user could for example schedule a high CPU schedule every hour during the weekend, which would basically shutdown all the executors and affect all the other hundreds of users. As far as I know, this scenario cannot be currently controlled by the admin.
In the meanwhile, a potential solution is to automatically check the status of the executors and restart them if they are inactive. But it would be great to avoid the issue happening in the first place.
There are a lot of options that are configurable (see [1]); relevant to you wanting to cancel jobs runing for longer than X time, please see:
com.knime.server.job.max_execution_time=<duration with unit, e.g. 60m, 36h, or 2d> [RT]
Allows to set a maximum execution time for jobs. If a job is executing longer than this value it will be canceled and eventually discarded (see com.knime.server.job.discard_after_timeout option). The default is unlimited job execution time. Note that for this setting to work, com.knime.server.job.swap_check_interval needs to be set a value lower than com.knime.server.job.max_execution_time.
com.knime.server.job.swap_check_interval=<duration with unit, e.g. 30s, 1m, or 1h> [RT]
Specifies the interval at which the server will check for inactive jobs that can be swapped to disk. Default is every 1m.
com.knime.server.job.discard_after_timeout=<true|false> [RT]
Specifies whether jobs that exceeded the maximum execution time should be canceled and discarded (true) or only canceled (false). May be used in conjunction with com.knime.server.job.max_execution_time option. The default (true) is to discard those jobs.