How to analyse issues with KNIME server?

Hello,

unfortunateley lately, we have more and more issues with the server not working anymore - meaning the webportal is not shown anymore AND/OR all the scheduled workflows are suddenly not executed anymore. As the latter is bussiness critical, I would like to find and fix the root cause (hardrive space) is not the issue)

Additionally, if I try to stop the domain of the sever for a restart (asadmin stop-domain), it just hangs and is not doing anything. Only after chancelling and stopping again it works again.

Now my question, how can I find out whats wrong (e.g. next time I see everything is hanging)? I saw some logs in the glassfish folder but I didnt really know what to look for and there didnt seem to be any errors indicated

PS: We are still using GlassFix v2.1.1 and knime server 3.10


 

One issue I had today, was a KNIME process using up all the CPU:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5387 knime 20 0 21.2g 8.2g 22m S 1584.1 17.5 8906:59 java

5387     /opt/knime/knime_2.11.2/jre/bin/java -XX:MaxPermSize=512m -server -Dsun.java2d.d3d=false -Dosgi.classloader.lock=classname -XX:+UnlockDiagnosticVMOptions -XX:+UnsyncloadClass -Dknime.enable.fastload=true -Xmx8G -Duser.home=/opt/knime/workflows/runtime/runtime_knime-rmi-1100 -Dcom.knime.server.rmi_port=1100 -Dknime.disable.vmfilelock=true -Djava.awt.headless=true -jar /opt/knime/knime_2.11.2//plugins/org.eclipse.equinox.launcher_1.2.0.v20110502.jar -os linux -ws gtk -arch x86_64 -launcher /opt/knime/knime_2.11.2/knime -name Knime --launcher.library /opt/knime/knime_2.11.2//plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.1.100.v20110505/eclipse_1407.so -startup /opt/knime/knime_2.11.2//plugins/org.eclipse.equinox.launcher_1.2.0.v20110502.jar --launcher.overrideVmargs -exitdata 3c7c8007 -user @none -data /opt/knime/workflows/runtime/runtime_knime-rmi-1100 -preferences=/opt/knime/workflows/.preferences.epf -application com.knime.enterprise.slave.KNIME_REMOTE_APPLICATION -vm /opt/knime/knime_2.11.2/jre/bin/java -vmargs -XX:MaxPermSize=512m -server -Dsun.java2d.d3d=false -Dosgi.classloader.lock=classname -XX:+UnlockDiagnosticVMOptions -XX:+UnsyncloadClass -Dknime.enable.fastload=true -Xmx8G -Duser.home=/opt/knime/workflows/runtime/runtime_knime-rmi-1100 -Dcom.knime.server.rmi_port=1100 -Dknime.disable.vmfilelock=true -Djava.awt.headless=true -jar /opt/knime/knime_2.11.2//plugins/org.eclipse.equinox.launcher_1.2.0.v20110502.jar

How can such a situation be analysed (root cause?)/avoided - checking the server manually from time to time is not a solution.

You can check what the KNIME process is doing using jstack . This gives you a stacktrace of all running threads. If a Java process is using all CPUs it's often the garbage collector that tries to free memory but can't. In such cases increasing the heap size may help.

Okay, thank you... i will check it next time it happens.

But back to my original issue - currently no one can login via the webportal (seems it cannot connect) and I cannot login via KNIME client directly either. Login to the GlassFish server directly works though.

I checked the jvm.log, knime-webportal.log and server.log and they all seem fine. How can I procceed to find the rootcause?

If such support/help is not covered by our server license, then please contact me directly so we can find a way.

Thank you.

What is the exact error message that you are getting when trying to login? Is there anything written in the server's log file?

What is the exact error message that you are getting when trying to login?

Its a timeout.

Is there anything written in the server's log file?

No, nothing. Thats why I was wondering if Im even looking at the relevant logs (as written above)?

A timeout indicates that either there is a network/firewall problem and packets get dropped or that the server is too busy. In the latter case you should see a very high load or lots of messages in the server's log file (domains/knime/logs/server.log). If the server isn't running at all you usually don't get a timeout but a "connection refused".

This morning we had the same problem:

  • No scheduled workflows are exectued anymore
  • Login from KNIME client is "Waiting" forever... same via Webportal
  • There are no recent entries in the server.log, the last one is many days old already:
    • INFO: WEB0712: Starting Sun GlassFish Enterprise Server v2.1.1 HTTP/1.1 on 4848
      Aug 31, 2015 11:41:00 AM com.sun.enterprise.management.selfmanagement.SelfManagementService onReady
      INFO: SMGT0007: Self Management Rules service is enabled
      Aug 31, 2015 11:41:00 AM com.sun.enterprise.server.PEMain main
      INFO: Application server startup complete.
  • There is no load on the server (as all scheduled jobs dont work anymore) but its still running
  • Glassfish is working without any issues and I can login via the portal

What can I check from here on?Are there any other logs/command or similar to check the state?

Can you send me the complete server.log to thorsten.meinl@knime.com?