Jobs get discarded automatically after a few minutes

Hi,

When we start a job (using workflow APIs) and while it’s running properly, it suddenly gets discarded after a few minutes without having any errors in the workflow.

What could be the cause of this issue? In what circumstances do jobs get discarded automatically? How can we fix this?

:blush:

Hi @armingrudd,

inactive jobs are automatically discarded after a specific time that can be defined in the KNIME Server configuration.

com.knime.server.job.max_lifetime=

Usually, it is set to 7 days.

Can you check, if it was decreased?

Best,
Julian

Hi @julian.bunzel,

No, it’s not the job max lifetime. The issue is that jobs get discarded while they are active and running.
Anyway, the configuration you mentioned is set to 7 days.

KNIME Executor version: 4.2.4
KNIME Server version: 4.11.4.0153-b0c4cc5ef

:blush:

Hey @armingrudd,

sorry, misread that one.

Has the maximum execution time been adjusted at some point?
com.knime.server.job.max_execution_time=
com.knime.server.job.discard_after_timeout=

These two settings in combination, could cause running worklows to be discarded in case they are exceeding the maximum execution time.

Does it happen with every job or just with specific ones?

Best,
Julian

The max_execution_time is not set. discard_after_timeout is true.

This happens to different workflows. But currently I have one that I’m sure about and this is the one which takes more than 10 minutes typically and the issue occurs after this time (but random, 10 to 15 minutes).

Also it seems the issue happens when we run jobs by using workflow APIs.

Hi @armingrudd

I see recently we have had some similar questions about this topic, so perhaps I can help with some clarification.

When using the :execution endpoint, as documented in the Swagger page, KNIME Server will create a job, execute this job, and discard it. However, there is a very important aspect of this: if the job is not executed within a certain timeout (default is ten minutes), the job will be cancelled and discarded. The timeout is not a loading timeout, but it is a call timeout. It is not possible to have an infinite timeout, either.

So there are two things you can do:
a) increase your time out
b) perform separate calls to create a job, and to execute a job, like I do in this example: https://kni.me/w/WK4ocXripq5o9quY

Best wishes
Ana

Hi @ana_ved ,
The workflow gets loaded and executed. It gets discarded in the middle of execution. I can track the progress by opening the job in KNIME AP but suddenly (after about 10 minutes) it gets discarded automatically.
What I’m suspecting is related to our last issue where we updated KNIME Server to 4.11.4 and then the workflows were not getting loaded. The issue was fixed by chance but now I guess the executor and KNIME Server are not communicating well so although the workflows get loaded, Server doesn’t know and discards the jobs.

:blush:

Hi @armingrudd

Are you using the :execution endpoint to execute with the standard timeout? If so, it is expected to get discarded in 10 minutes even if the workflow execution was not finished. The timeout is the call time out, not loading timeout.

If the above is not the case, we should investigate your hypothesis. Let me know :smiley:

Best wishes
Ana

We are using :execution flag. We increased the timeout but didn’t work.

:blush:

Ah, ok!

Can you try with the approach I give as an example here: https://kni.me/w/WK4ocXripq5o9quY

This way there would be no timeout and automatic discard. If the job gets discarded anyways after ten minutes, that would be suspicious.

I think I’m not able to send workflow parameters using “jobs” instead of “execution” since I’m using container input nodes. The request body contains the workflow variables not the parameters I have created in the JSON Container.

I think I talked about this in one of the Summit sessions and it was agreed that I need to use “execution” to be able to send parameters. Am I missing something?

:blush:

@ana_ved ,

http://…/knime/rest/v4/repository/Test_Workflow_API_execution:execution?timeout=1800000

Is this the correct format to send the request with the “timeout” parameter? when I use Swagger, the timeout works, but the request like the one above from an external app doesn’t.

Hi @armingrudd,

yes, this is the correct format. Can you provide the logs for the recent attempts via mail again?
Sorry about the inconvenience again.

Best,
Julian

Hi @julian.bunzel ,

I sent you the log files via email.

Thanks.

1 Like

Hi again,

I can see the request returns a 500 (Internal Server Error) response code. Whenever there is a code other than 2xx, there should be a response body containing the specific error message. Does the external tool you are using to call the workflow allows you to inspect the response?
If not, you could try to call this workflow by using Postman and see what kind of error you see there.

There is also an issue that a defined parameter (db-schema) cannot be used since there is no node that makes use of it. Can you remove that one if not needed and try again?

Best,
Julian

Hi @armingrudd,

we found a solution to execute the workflow in two steps (create job, then execute).
First you need to create the job by using

:jobs?timeout=-1

instead of

:execution

Now for executing and parameterization we can use:

http://hostname:port/knime/rest/v4/jobs/{job-id}

For this call you can specify the parameters in the request body like this:

{
“json-input” : {“param”: “value”}
}

“json-input” refers to the parameter name of the Container Input (JSON) node and might be different in your case.

Anas workflow shows how to do this two step approach with a KNIME Workflow:
create_job_and_execute.knwf (24.4 KB)

If you want to do it with an external tool, you need to take the same steps as described. You might need to wait a little bit between each call because you can’t execute the workflow if it has not been successfully loaded.

Best,

Julian

1 Like

@ana_ved @julian.bunzel,

It seems that the timeout parameter has a max value of less than 30 minutes. I tested in Swagger and when I set the timeout to 1800000 the job gets discarded before it is finished. The test workflow runs a wait node and I set the wait time to 16 minutes. The job gets discarded before it is successfully completed.

So it seems that we have to use the “jobs” endpoint instead of “execution”. I will try your solution and will get back to you.

Thank you for taking the time to solve this issue.

:blush:

@ana_ved @julian.bunzel ,

Thank you so much guys. The 2 step method to run the jobs works perfectly fine. We could send parameters to the jobs in the second request and the workflow did not get discarded.

:blush:

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.