Disabling schedules after execution failure

Dear community,

I’m trying to create a “schedule_disabler” workflow that, when invoked will disable a given schedule, based on the job that called this disabler. Some of our schedules have very short execution intervals and when there are some unforseen issues with the data, that the workflow was not designed to deal with, these schedules continue to attempt execution many times. For most of these schedules, I would prefer if the schedule just gets instantly disabled on first failure until someone has time to have a look at it.

My approach so far has been to call the disabler on workflow failure (inside the schedule configuration options) and then hand over a string to the disabler that I can manually set in the schedule configuration, so that the disable knows what schedule it should disable.

I wonder if there is a way to do this automatically. Is there a way for the disabler-job to know what other job has called it in this scenario? Container Input won’t work here I assume, since I’m technically not “calling” another workflow classically with one of the “Call workflow” Nodes. The job is spun up, because I asked it to inside the schedule configuration. I’m trying to avoid having to set anything manually during configuration.

I hope I am making sense.

Best regards,
Smnkrs

There is a knime-server.config option which I think, covers your case well.

com.knime.server.job.max_schedule_failures=<number>
Specifies the maximum number of consecutive failures to start a scheduled job before the schedule gets disabled. The default value is three consecutive failures. If a negative value is provided (e.g. -1) scheduled jobs will never get disabled due to failures.

Wouldn’t setting this value to 1 work for you?
One obvious caveat is this will affect all schedules globally.

Best,
Temesgen

Thanks Temesgen for your reply.

Unfortunately this setting does something slightly different at least with the 4.13.4 server version that I am testing this with.
This setting disables schedules when the job is failing to load. Meaning, that the job never even starts to execute. But I’m contending with jobs that load just fine, but then due to unintended inputs (data with different tables specs, empty tables, etc.) the jobs cannot finish successful execution.

This config option you mention is set to 3 for us and we have actually encountered this work properly once before, but it unfortunately does not help in my case. Maybe a server update will change this behavior? I recognize that the wording you quote there is slightly different to the wording in the configuration page of my server installation (the word “load” is mentioned for example in my case).

But, additionally, there is a sort of different functionality I would like bring in with my described solution:

Once I know what schedule has just failed, I can send custom E-Mail notifications instead of the simple “workflow failure mails” that I can set up in the schedule configuration. I would like to set up a sort of system to manage these failures, I could imagine sending a link in that E-Mail that will allow the user to restart the schedule (after they have figured out a potential problem with the input data that caused the issue), so the workflow can continue to run until the user is able to update the workflow to handle these new edge case.

Best,
smnkrs

Hi smnkrs,

Thanks for the clarification, I understand now and it makes perfect sense for me too.

The option I mentioned only works if the job fails to start for any reason. It is blind to the sate of a started job. Your original approach is the only thing I can come up with.

Hi

I think I have a solution for you.
From the workflow being called (disable_schedule) you can get its job_id. With that job ID you can fetch the complete information of the job including the parent job id which is the key to refer back to the workflow and the schedules under it.

Then you can disable the schedules under that workflow.

For this to work jobs shouldn’t be discarded automatically.

I have created a workflow that does what I just explained. Feel free to give it a try and adjust it to your need.

Let me know if this helps

Best,
Temesgen

2 Likes

Hi Temesgen,

excellent pointer. That is exactly what I was looking for! I’ll try this out soon and let you know if I find anything else of note.

Thanks a lot!

Best,
Simon

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.