I am tasked with supporting the server our team is using to do calculations. the server hangs and needs to be rebooted 1-2 hours after the job is kicked off. the server has 56 cpu’s and 64GB of ram and is running RHEL7.5. while the job is running it might have 8-10 cpu’s at 85-100% utilization and consume about 8GB of RAM. I don’t really know much about what the end users are doing with this server. Can any provide some idea of where to begin troubleshooting this issue. what logs can I inspect? what questions can I ask the end user to provide more info to this forum? I have worked with the OS vendor and they have not provided much help.
Welcome to the KNIME community!
To troubleshoot, we need to have a look at the log files. You can get them after logging in the the KNIME WebPortal as admin. Then go to Administration (top right corner), and click “Download logs”. From the downloaded logs, please send the ones from a day where the issue occurred. I’ll contact you via pm so you don’t have to share your logs publicly.
In case you don’t have access to the WebPortal, you can grab the logs directly from the server, they are in / apache-tomee-plus-7.0.5/logs. From there, I’d need the catalina.yyyy-MM-dd.log and the localhost.yyyy-MM-dd.log.
In addition, please check the knime.ini (located in the server executor directory, next to the executable) to see what value is set for -Xmx. If this is only -Xmx8G, you can safeliy increase this to a larger value, given that you have 64 GB at your disposal.
That should be it for a start!