After upgrading to Knime 5.1 and turning on the columnar backend we are experiencing that unstable execution of Knime workflows in batch mode under Ubuntu. The errors are not consistent, but showing up quite frequently. Below is one example.
ERROR KNIME-Worker-7-Loop End 3:696:0:649 LocalNodeExecutionJob Caught “IllegalStateException”: Memory was leaked by query. Memory leaked: (659456)
Allocator(ArrowColumnStore) 0/659456/41308160/2411724800 (res/actual/peak/limit)
java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (659456)
Allocator(ArrowColumnStore) 0/659456/41308160/2411724800 (res/actual/peak/limit)
We also experienced memory leakage leading to continuously allocation of memory when running the columnar back end with KNIME 5.1 on Ubuntu, but this seems to have been solved by installing libjemalloc2 as was suggested in an article I found somewhere.
Thank you for the report and sorry to hear that you are experiencing memory issues.
The memory leak is unfortunately a problem with the Java virtual machine (JVM) running on glibc on linux. This is very unfortunate, but nothing we can fix. Switching to jemalloc was a perfect workaround, I’m glad you found that! We have also suggested a few fixes at the end of the configuration section in our docs KNIME Analytics Platform User Guide.
Regarding the Memory was leaked by query issues. Could you give us a longer stack trace when this problem occurs? That would be really helpful. The real bug has probably happened before, leading to an illegal state of the allocator, which then shows this error.
Now I got the following error running another workflow in the same 5.1 environment, but without the columnar back end (as the columnar back end is not able to run this workflow for another 5.1 bug).
WARN KNIME-Worker-57-POST Request 3:1368:1 Node No row key column selected generate a new one
[3334.837s][warning][os,thread] Attempt to protect stack guard pages failed (0x00007fa4db300000-0x00007fa4db304000).
A fatal error has been detected by the Java Runtime Environment:
Native memory allocation (mprotect) failed to protect 16384 bytes for memory to guard stack pages
An error report file with more information is saved as:
/app/hs_err_pid67.log
[3335.137s][warning][os,thread] Attempt to protect stack guard pages failed (0x00007fa4db200000-0x00007fa4db204000).
[thread 101801 also had an error]
If you would like to submit a bug report, please visit:
It’s running with -Xmx22G. The system has 64Gb. Nothing else is running on the system at the same time.
However, it’s is running Knime in a Docker container (which is also did without problems prior to 5.1).
This workflow is looping through 115k records that is sent in batches of just bellow 1000 to a external REST API using the POST Request node. It seems to happen about 1/3rd through the nodes.
The memory usage reported for the Docker container seems to be way below the system capacity, but it is increasing slowly as it loops through.
It appears that the loop both leaks memory and processes.
Bellow is output from Docker Stats just besfore it crashed. Mem usage has grown from 20Gb something before it startet the loop to call the Post Request node to 37 Gb, but it seems there is one active process for each line in the table being sent to the Post Request node.
I have a copy of the log file now, but it’s to large to upload here uncompressed (8Mb) and it’s not possible to upload a compressed version. Are there any other channels I can use to send this file?
thank you for the detailed report! Since you mention that processes are leaked and you use the POST Request node in a loop, I believe the problem you are having is triggered due to a bug in the underlying HTTP library we use (Apache CXF, where each request we do leaks one thread…).
We are currently working on getting a fix ready on our end, to be released with 5.1.1 (likely released later this month).
Thanks for the reply. It seem like this may be the problem with the workflow using the POST Request node. It’s actually a very critical situation for us as we have trouble downgrading KNIME at this point. On MacOS it fails even earlier. After around 5000+ POST requests it just freezes. , On Ubuntu it seems able to process a lot more, and it actually returns with an error. Is there any way to increase the max number of processes on Ubuntu to be able to live with this until you will have your fix ready?
Regarding the original post in this thread seems that it is unrelated to the POST request bug as the “Memory was leaked by query” is happening occasionally in workflows that does not do anything with API-calls. I’ll come back with more log information when I’ve been able to rerun that workflow.
Regarding the POST-related problem, we should have a fix on the 5.1 bugfix nightly update site sometime tomorrow. You could use it (maybe in a fresh install & workspace just to be careful) to test your workflow with that. Otherwise, we also have a patched jar (CXF 4.0.1) that you could swap the one in the release build with.
(also, a PR is already open to fix the root cause, CFX bug 8885 – see linked PR)
Regarding your original problem, I would have to hand back to @carstenhaubold.
If your log does not contain sensitive information, you could upload it to your public Community Hub space or something like Google Drive.
This sounds promising, but how do I get hold of the patched jar and how do I install that? The only distribution I find on Apaches pages are 4.0.1 and I can’t see if they are patched, and the files is named differently than the jar files in the KNIME installation.
In order to create the patched jar file, you’d have to (at your own risk) do the steps below. From the screenshot you shared, it looks like your workflow is writing important data. I would test the steps below with a fresh ZIP installation of KNIME Analytics Platform and a fresh workspace, using some sample workflow that does many REST node requests in a loop.
But I really recommend waiting for the official bugfix release (or testing with the bugfix-nightly update site), where we’ll have a fix for the problem distributed through the official channel…
That being said, at your own risk (& maven installed, using the command line on Linux/macOS):
Thank you hotzm for the detailed instructions. I managed to follow it and get it to work on my Mac. I’ve not updated production, but was able to update the customer data as was urgently needed. Thanks.
It worked, but a bit slow, and the number of processes grows quite a lot, but processes are released after a while. Adding som extra wait time and sending less updates in each batch helped get it through on the Mac that has a lower limit on processes than our Ubuntu production environment.
Regarding the memory leak issue it seems this seems to be happening only occasionally even when running the same job on mostly the same data. It completed OK during tonights normal production run, but when I tried again just now (with the same data and nothing else running on the system) it failed and produced the log output shown below. Interestingly it produces a few ordinary looking warnings after the error (KNIME produces way too many strange warnings in the log files). However, KNIME stops processing anything more and is killed by a maintenance process we have after 30 minutes of inactivity. It is running in batch mode in a Docker container. The system has 64Gb RAM, and Xmx22Gb is is use. We did not see anything like this on 4.7.
CompileCommand: exclude javax/swing/text/GlyphView.getBreakSpot bool exclude = true
ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console…
WARN KNIME-Worker-2-Table Creator 3:713 Node Node created an empty data table.
WARN KNIME-Worker-0-Parquet Reader 3:1074 Node Not authenticated
WARN KNIME-Worker-1-Table Row To Variable 3:850 Node Not authenticated
WARN KNIME-Worker-0-Microsoft Authenticator 3:1071:705 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-0-Microsoft Authenticator 3:1071:705 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-0-Microsoft Authenticator 3:1071:705 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-0-Microsoft Authenticator 3:1071:705 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-0-Microsoft Authenticator 3:1071:705 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-0-Microsoft Authenticator 3:1071:705 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-0-Microsoft Authenticator 3:1071:705 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-0-Microsoft Authenticator 3:1071:705 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-0-Microsoft Authenticator 3:1071:705 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-0-Microsoft Authenticator 3:1071:705 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-0-Microsoft Authenticator 3:1071:705 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-0-Microsoft Authenticator 3:1071:705 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-1-String Manipulation (Variable) 3:1071:984 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-1-String Manipulation (Variable) 3:1071:984 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-1-String Manipulation (Variable) 3:1071:984 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-1-String Manipulation (Variable) 3:1071:984 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-1-String Manipulation (Variable) 3:1071:984 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-1-String Manipulation (Variable) 3:1071:984 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-1-String Manipulation (Variable) 3:1071:984 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-1-String Manipulation (Variable) 3:1071:984 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-1-String Manipulation (Variable) 3:1071:984 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-1-String Manipulation (Variable) 3:1071:984 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-1-String Manipulation (Variable) 3:1071:984 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-1-String Manipulation (Variable) 3:1071:984 Node No connection available. Execute the connector node first.
WARN KNIME-Worker-7-Parquet Reader 3:666:704 Node The selected columns have different type. Using string representation for comparison.
WARN KNIME-Worker-10-Concatenate 3:712 Node No grouping column included. Aggregate complete table.
WARN KNIME-Worker-8-Date&Time Shift 3:696:0:392:267 Node No grouping column included. Aggregate complete table.
WARN KNIME-Worker-1-Date&Time Shift 3:696:0:392:170 Node No grouping column included. Aggregate complete table.
WARN KNIME-Worker-10-Date&Time Shift 3:696:0:392:392 Node No grouping column included. Aggregate complete table.
WARN KNIME-Worker-12-Date&Time Shift 3:696:0:392:171 Node No grouping column included. Aggregate complete table.
WARN KNIME-Worker-6-Date&Time to String 3:696:0:392:172 Node No grouping column included. Aggregate complete table.
WARN KNIME-Worker-11-Date&Time to String 3:696:0:392:175 Node No grouping column included. Aggregate complete table.
WARN KNIME-Worker-0-Partitioning 3:696:0:647 Node No grouping column included. Aggregate complete table.
WARN KNIME-Worker-3-String Manipulation 3:696:0:630:531 ColumnCalculator Row “Row3” contains missing value in column “Vervsted id” - returning missing (omitting further warnings)
WARN KNIME-Worker-8-String Manipulation 3:696:0:662:531 ColumnCalculator Row “Row5” contains missing value in column “Vervsted id” - returning missing (omitting further warnings)
WARN KNIME-Worker-9-String Manipulation 3:696:0:630:531 ColumnCalculator Row “Row3” contains missing value in column “Vervsted id” - returning missing (omitting further warnings)
WARN KNIME-Worker-0-String Manipulation 3:696:0:662:531 ColumnCalculator Row “Row5” contains missing value in column “Vervsted id” - returning missing (omitting further warnings)
ERROR KNIME-Worker-0-Loop End 3:696:0:650 LocalNodeExecutionJob Caught “IllegalStateException”: Memory was leaked by query. Memory leaked: (659456)
Allocator(ArrowColumnStore) 0/659456/47060992/2411724800 (res/actual/peak/limit)
java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (659456)
Allocator(ArrowColumnStore) 0/659456/47060992/2411724800 (res/actual/peak/limit)
at org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:476)
at org.knime.core.columnar.arrow.AbstractArrowBatchReadable.close(AbstractArrowBatchReadable.java:100)
at org.knime.core.columnar.arrow.ArrowBatchStore.close(ArrowBatchStore.java:113)
at org.knime.core.columnar.cache.data.ReadDataCache.close(ReadDataCache.java:345)
at org.knime.core.columnar.cache.writable.BatchWritableCache.close(BatchWritableCache.java:292)
at org.knime.core.columnar.data.dictencoding.DictEncodedBatchWritableReadable.close(DictEncodedBatchWritableReadable.java:108)
at org.knime.core.columnar.cache.object.ObjectCache.close(ObjectCache.java:197)
at org.knime.core.data.columnar.table.WrappedBatchStore.close(WrappedBatchStore.java:222)
at org.knime.core.data.columnar.table.DefaultColumnarBatchStore.close(DefaultColumnarBatchStore.java:378)
at org.knime.core.data.columnar.table.ColumnarRowReadTable.close(ColumnarRowReadTable.java:207)
at org.knime.core.data.columnar.table.AbstractColumnarContainerTable.clear(AbstractColumnarContainerTable.java:218)
at org.knime.core.node.BufferedDataTable.clearSingle(BufferedDataTable.java:972)
at org.knime.core.node.Node.disposeTables(Node.java:1709)
at org.knime.core.node.Node.cleanOutPorts(Node.java:1673)
at org.knime.core.node.workflow.NativeNodeContainer.cleanOutPorts(NativeNodeContainer.java:624)
at org.knime.core.node.workflow.NativeNodeContainer.performReset(NativeNodeContainer.java:618)
at org.knime.core.node.workflow.SingleNodeContainer.rawReset(SingleNodeContainer.java:501)
at org.knime.core.node.workflow.WorkflowManager.invokeResetOnSingleNodeContainer(WorkflowManager.java:5124)
at org.knime.core.node.workflow.WorkflowManager.resetNodesInWFMConnectedToInPorts(WorkflowManager.java:2776)
at org.knime.core.node.workflow.WorkflowManager.resetNodesInWFMConnectedToInPorts(WorkflowManager.java:2779)
at org.knime.core.node.workflow.WorkflowManager.restartLoop(WorkflowManager.java:3745)
at org.knime.core.node.workflow.WorkflowManager.doAfterExecution(WorkflowManager.java:3616)
at org.knime.core.node.workflow.NodeContainer.notifyParentExecuteFinished(NodeContainer.java:689)
at org.knime.core.node.workflow.NodeExecutionJob.internalRun(NodeExecutionJob.java:238)
at org.knime.core.node.workflow.NodeExecutionJob.run(NodeExecutionJob.java:117)
at org.knime.core.util.ThreadUtils$RunnableWithContextImpl.runWithContext(ThreadUtils.java:367)
at org.knime.core.util.ThreadUtils$RunnableWithContext.run(ThreadUtils.java:221)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at org.knime.core.util.ThreadPool$MyFuture.run(ThreadPool.java:123)
at org.knime.core.util.ThreadPool$Worker.run(ThreadPool.java:246)
ERROR KNIME-Worker-0-Loop End 3:696:0:650 LocalNodeExecutionJob Caught “IllegalStateException”: Memory was leaked by query. Memory leaked: (659456)
Allocator(ArrowColumnStore) 0/659456/47060992/2411724800 (res/actual/peak/limit)
java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (659456)
Allocator(ArrowColumnStore) 0/659456/47060992/2411724800 (res/actual/peak/limit)
at org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:476)
at org.knime.core.columnar.arrow.AbstractArrowBatchReadable.close(AbstractArrowBatchReadable.java:100)
at org.knime.core.columnar.arrow.ArrowBatchStore.close(ArrowBatchStore.java:113)
at org.knime.core.columnar.cache.data.ReadDataCache.close(ReadDataCache.java:345)
at org.knime.core.columnar.cache.writable.BatchWritableCache.close(BatchWritableCache.java:292)
at org.knime.core.columnar.data.dictencoding.DictEncodedBatchWritableReadable.close(DictEncodedBatchWritableReadable.java:108)
at org.knime.core.columnar.cache.object.ObjectCache.close(ObjectCache.java:197)
at org.knime.core.data.columnar.table.WrappedBatchStore.close(WrappedBatchStore.java:222)
at org.knime.core.data.columnar.table.DefaultColumnarBatchStore.close(DefaultColumnarBatchStore.java:378)
at org.knime.core.data.columnar.table.ColumnarRowReadTable.close(ColumnarRowReadTable.java:207)
at org.knime.core.data.columnar.table.AbstractColumnarContainerTable.clear(AbstractColumnarContainerTable.java:218)
at org.knime.core.node.BufferedDataTable.clearSingle(BufferedDataTable.java:972)
at org.knime.core.node.Node.disposeTables(Node.java:1709)
at org.knime.core.node.Node.cleanOutPorts(Node.java:1673)
at org.knime.core.node.workflow.NativeNodeContainer.cleanOutPorts(NativeNodeContainer.java:624)
at org.knime.core.node.workflow.NativeNodeContainer.performReset(NativeNodeContainer.java:618)
at org.knime.core.node.workflow.SingleNodeContainer.rawReset(SingleNodeContainer.java:501)
at org.knime.core.node.workflow.WorkflowManager.invokeResetOnSingleNodeContainer(WorkflowManager.java:5124)
at org.knime.core.node.workflow.WorkflowManager.resetNodesInWFMConnectedToInPorts(WorkflowManager.java:2776)
at org.knime.core.node.workflow.WorkflowManager.resetNodesInWFMConnectedToInPorts(WorkflowManager.java:2779)
at org.knime.core.node.workflow.WorkflowManager.restartLoop(WorkflowManager.java:3745)
at org.knime.core.node.workflow.WorkflowManager.doAfterExecution(WorkflowManager.java:3616)
at org.knime.core.node.workflow.NodeContainer.notifyParentExecuteFinished(NodeContainer.java:689)
at org.knime.core.node.workflow.NodeExecutionJob.internalRun(NodeExecutionJob.java:238)
at org.knime.core.node.workflow.NodeExecutionJob.run(NodeExecutionJob.java:117)
at org.knime.core.util.ThreadUtils$RunnableWithContextImpl.runWithContext(ThreadUtils.java:367)
at org.knime.core.util.ThreadUtils$RunnableWithContext.run(ThreadUtils.java:221)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at org.knime.core.util.ThreadPool$MyFuture.run(ThreadPool.java:123)
at org.knime.core.util.ThreadPool$Worker.run(ThreadPool.java:246)
WARN KNIME-Worker-5-String Manipulation 3:696:0:630:531 ColumnCalculator Row “Row3” contains missing value in column “Vervsted id” - returning missing (omitting further warnings)
WARN KNIME-Worker-9-String Manipulation 3:696:0:630:531 ColumnCalculator Row “Row3” contains missing value in column “Vervsted id” - returning missing (omitting further warnings)
WARN KNIME-Worker-12-String Manipulation 3:696:0:630:531 ColumnCalculator Row “Row3” contains missing value in column “Vervsted id” - returning missing (omitting further warnings)
You could try incorporating this node into the loop temporarily until there is a fix. Not sure if it will help in your case, but I have used it to reduce unreleased memory buildup in loops before.
We have made our own Docker image and simple scheduling system to enable us to run our KNIME production workflows in batch mode, isolating them as much as possible from each other while also allowing us to set up VPN connections to our customers’ environments to access data.
I don’t know if there exist any publicly available and up to date Docker images with KNIME.