KNIME webportal timeout issues

Hi all,

Frequently we observe some disconnections when we launch Knime workflows from our webportal. Here are various issues extracts that we can see in the log files :

In localhost.x.log:

20-Jan-2021 10:31:14.797 SEVERE [http-nio-8080-exec-9] com.knime.enterprise.webportal.WebportalUI.error class java.io.IOException
java.io.IOException: Connection reset by peer

20-Jan-2021 10:31:14.789 SEVERE [http-nio-8080-exec-9] com.knime.enterprise.webportal.WebportalUI.displayCustomError Load error
java.util.concurrent.TimeoutException: Job wasn’t loaded within 1 minute. This usually happens if no executor is available. Please check back with your server administrator.

20-Jan-2021 10:31:14.789 SEVERE [http-nio-8080-exec-9] com.knime.enterprise.webportal.components.wkfpanels.WorkflowJobInputFirstPagePanel.attach Failed to load quick form input: Job wasn’t loaded within 1 m
inute. This usually happens if no executor is available. Please check back with your server administrator.

20-Jan-2021 13:06:19.023 SEVERE [http-nio-8080-exec-2] com.knime.enterprise.webportal.components.wkfpanels.WorkflowJobInputSubsequentPagePanel.attach Failed to load quick form input: Unable to get current page: Failed to restore job ‘/Chemistry[…]7ae)’ from swap within timeout: Waiting time has elapsed before state change occured

In catalina.x.log:

20-Jan-2021 10:31:14.797 SEVERE [http-nio-8080-exec-9] com.vaadin.server.DefaultErrorHandler.doDefault
org.apache.catalina.connector.ClientAbortException: java.io.IOException: Connection reset by peer

20-Jan-2021 10:23:28.281 SEVERE [http-nio-8080-exec-9] org.apache.cxf.jaxrs.utils.JAXRSUtils.logMessageHandlerProblem Problem with writing the data, class java.lang.String, ContentType: application/vnd.knime.workflow+zip

In knime-server.config file, we have: com.knime.server.gateway.timeout=2m ; so the error “Job wasn’t loaded within 1 m” seems not to use 2m for this timeout, which configuration parameter is used for that ?
We are using default values for com.knime.server.job.default_report_timeout, com.knime.server.job.discard_after_timeout and com.knime.server.job.max_execution_time.

Have you any recommendations/advice to avoid this types of issues ?

Maybe the error “JAXRSUtils.logMessageHandlerProblem Problem with writing the data” is not linked to the others, do you know what could be the possible cause of this issue?

We are using knime client v.4.2.2 for webportal and server v.4.11.3.

Best,
Cyrille

Hi Cyrille,

Do you happen to know what your xmx settings are for the knime.ini and the Apache Tomcat service? It may be a lack of dedicated memory if relatively normal workflows are failing to load or be swapped. Please let us know those settings and we can continue from there.

Thanks,
Zack

Hi Zack,

Yes and we are searching to adjust the settings to fit with our consommation (or to know if more memory is needed).

server.xml:

[…] < Resource auth=“Container” driverClassName=“org.h2.Driver” maxIdle=“30” maxTotal=“100” maxWaitMillis=“10000” name=“H2UserDatabase” type=“javax.sql.DataSource” url=“jdbc:h2:${catalina.home}/conf/userconf”/>
[…]
< Connector compressibleMimeType=“text/html,text/xml,text/plain,text/css,text/javascript,application/javascript,application/json,application/vnd.mason+json” compression=“on” connectionTimeout=“20000” port="" protocol=“HTTP/1.1” redirectPort="" server=“Apache Tomcat”/>
< Connector SSLEnabled=“true” compressibleMimeType=“text/html,text/xml,text/plain,text/css,text/javascript,application/javascript,application/json,application/vnd.mason+json” compression=“on” maxThreads=“150” port="*" protocol=“org.apache.coyote.http11.Http11Nio2Protocol” scheme=“https” secure=“true” server=“Apache Tomcat”>
< SSLHostConfig protocols=“all,-TLSv1,-SSLv3,-SSLv2Hello”>
< Certificate certificateKeystoreFile=“conf/knime-server.jks” certificateKeystorePassword=“knimeknime” type=“RSA”/>
</ SSLHostConfig>
</ Connector>
[…]

knime.ini:

-startup
plugins/org.eclipse.equinox.launcher_1.5.700.v20200207-2156.jar
–launcher.library
plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.1.1100.v20190907-0426
-vm
plugins/org.knime.binary.jre.linux.x86_64_1.8.0.252-b09/jre/bin
-profileLocation
http://127.0.0.1:8080/knime/rest/v4/profiles/contents
-profileList
executor
-vmargs
-Dcom.knime.enterprise.executor.msgq=amqp://knime:20knime16@127.0.0.1/
-Dchemaxon.license.url=/site/tl/app/x86_64/discovery/knime/desktop/4.2.0_wp/licenses/license.cxl
-server
-Dsun.java2d.d3d=false
-Dosgi.classloader.lock=classname
-XX:+UnlockDiagnosticVMOptions
-XX:+UnsyncloadClass
-XX:+UseG1GC
-Dsun.net.client.defaultReadTimeout=0
-XX:CompileCommand=exclude,javax/swing/text/GlyphView,getBreakSpot
-Xmx16G
-Dorg.eclipse.swt.internal.gtk.disablePrinting
-Duser.language=en

We just changed -Xmx8G -> -Xmx16G yesterday in knime.ini because we had the error “The web application [knime] appears to have started a thread named [Thread-24] but has failed to stop it. This is very likely to create a memory leak.” in catalina log, and “java.util.concurrent.TimeoutException: Waiting time has elapsed before state change occured” in same time in localhost log. Do you think are there some other actions we can do to avoid these issues ?

Best regards,
Cyrille

PS, There is only Knime server on this server ; few minutes after reboot server we have:
$ free -m
total used free shared buff/cache available
Mem: 23948 4108 15259 203 4579 19274
Swap: 2047 964 1083

With some protocols running for example we had:
$ free -m
total used free shared buff/cache available
Mem: 23948 20769 453 192 2725 2624
Swap: 2047 1412 635