Stop the execution of workflow after certain period of time

yanbeemwe · August 7, 2024, 12:55pm

Hi,

we run multiple workflows on KNIME Server in intervalls of 30mins. Sometimes when the server is experiencing bad performance due to too many workflows running at once, these workflows may take 3 hours instead of 10-20mins. However for our company its essentiell, some workflows are executed every 30 mins so we cannot really rely on the KNIME Server option to wait until the previous job is complete to only then execute the next job. I read there is a global option for every workflow on the server to basically stop the execution after a certain period of time. For us this option wouldnt be feasible due to some workflows that are supposed to run for hours.

What I tried was having a Wait node + a Breakpoint node. After 20 mins the Breakpoint would give an error in that branch. But the workflow would still run normally in the other branch so the server still considers it a valid execution.

Is there a way to build workflows in a way that the whole workflow fails if it runs for more than say 20mins? So that the KNIME server stops the execution of the workflow altogether.

Thanks in advance!

iCFO · August 7, 2024, 7:03pm

Perhaps this to measure execution time differences and then trigger Breakpoint or Fail in Execution at a strategic point via an IF Switch or Case Switch if the execution time is greater than your target? It may take multiple failure switches depending on your workflow structure. I don’t know of any node that can halt all execution workflow wide…

Execution_time_breakpoint.knwf (42.5 KB)

I put together an example. I set it to minutes to fit your needs, but that means that you need to wait a minute on execution to test it yourself. Currently there is a wait time of 1 minute and a test of >= 1 minute on the breakpoint. You can adjust the breakpoint minute trigger in the Variable Expressions node.

iCFO · August 8, 2024, 4:25am

I think that this may be more what you are looking for.

yanbeemwe · August 8, 2024, 7:48am

Hi @iCFO

thank you for your suggestions first and foremost. The setting that you mentioned in your second post only works globally right? For our usecase, sadly this is not practical due to the fact that we have some workflows that are supposed to run for like 5h. Setting the server timeout to e.g. 1h would also affect these 5h workflows right? Or is it possible to configure each job seperately?

Your second idea is good. I think I will give it a try, however its still not perfect due to the possibility that if a single node ran for hours, the breakpoint could not be triggered.

I’m very much looking for more possibities, setting up a timeout for each job should honestly be a standard feature to assure server performance. @knime

iCFO · August 8, 2024, 11:52am

I believe that the above setting is global for all jobs, and peppering the workflow with breakpoints definitely isn’t going to help if you are hung up on executing a single node…

I would reach out to your account manager on this one. Perhaps they have an unpublished workaround for a per workflow max execution timeout. In the very least it will get the issue on their radar.

I would personally love to see a node that can “Cancel All Execution” instead of just the failing to cancel downstream nodes… I had high hopes that “Breakpoint” would cancel all execution of a workflow when it came out, but it turned out to just work similarly to “Fail in Execution”.

NDekay · August 8, 2024, 5:28pm

com.knime.server.job.max_execution_time=<duration with unit, e.g. 60m, 36h, or 2d> [RT]

Allows to set a maximum execution time for jobs. If a job is executing longer than this value it will be canceled and eventually discarded (see com.knime.server.job.discard_after_timeout option). The default is unlimited job execution time. Note that for this setting to work, com.knime.server.job.swap_check_interval needs to be set a value lower than com.knime.server.job.max_execution_time.

The above server config is a thing that exists, but OP already mentioned this won’t work for them, as they have other workflows which do need to execute longer than that.

It’s just the one workflow that has time constraints.

com.knime.server.job.discard_after_timeout=<true|false> [RT]

Specifies whether jobs that exceeded the maximum execution time should be canceled and discarded (true) or only canceled (false). May be used in conjunction with com.knime.server.job.max_execution_time option. The default (true) is to discard those jobs.

This won’t help here; this only controls whether the execution files get discarded immediately or left in swap for the 7d default duration (com.knime.server.job.max_lifetime).

What I would probably do is this:

Get an idea of how long the actual execution time of the Special Job workflow takes.
You could utilize the Execution_time_metanode for this. If it is under 30 min, great.
This is to keep tabs on how long it’s actually running, in case data makes it increase, so you can take corrective measures before it exceeds the 30m threshold.

KNIME server feeds jobs to its executor(s) on a first-come, first-serve, up-to-capacity basis. So there’s no individual “kill this job if it runs longer than X min, but only this one job” options.
If you have jobs running hours, then the first thing I’d probably do is take a hard look at how much resources (core tokens and heap) the executor is getting. Heap to ensure it has the memory to handle all the jobs and data it’s being asked to, and cores/tokens to ensure it can crunch things fast enough to not make things take forever to finish.

But, assuming that just buffing your existing executor won’t quite cut it, another thing you can do is set up Distributed Executors [1] so that you have an executor basically dedicated to time-sensitive jobs like your Special Job.

Then, you use Workflow Pinning [2] to route those special jobs to your super-fast executor so that it’s not sitting in queue with slower traffic.

Then, as long as its data doesn’t grow too much and it stays under the 30m, you’ll be golden.

[1] KNIME Server Installation Guide
[2] KNIME Server Administration Guide

wkhan · August 9, 2024, 12:59am

The max execution time Nick provided would be the setting to globally cancel all workflows after 20m’s.

Another idea might be to run a scheduled job that periodically polls what is running on the server and calculates it’s run time. You could then be a bit more precise with specific workflows you want to cancel/keep alive.

Ultimately, and depending on your circumstance, I don’t think its a good idea to cancel anything due to resource constraints and would recommend you reach out to your KNIME Contact about figuring out a solution together. You could look at segregating and prioritizing workflows with Distributed Execution or maybe just partner to find a few more CPU/Memory as a longer-term solution.

It sounds like these are important workflows and its not too expensive to increase compute sizes so that the business doesn’t need to affected long-term.

system · November 7, 2024, 1:00am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.