Dear KNIMEler,
I’m getting memory errors from an End Loop node: The message is: LocalNodeExecutionJob : Loop End : 7:1908:1903:0:1423:1689:1683:1216 : Caught “IllegalStateException”: Memory was leaked by query. Memory leaked: (1048576) Allocator(ArrowColumnStore) 0/1048576/1179648/9223372036854775807 (res/actual/peak/limit)
see attached log for the stack-trace.
This crash happened on a Linux machine with KNIME 4.3.1 using the ‘Columnar Backend’ (lab)
My observation is, that this crash happens (only?) in the context of ‘wide-tables’. I have a WF in place, which loops over each column Ci of a table, computing some statistics and adapting the values of the column Ci accordingly. In other words, I’m transforming a table A (m columns, n rows) into a table B (also having m columns, n rows) - column wise with a Column List-loop and using the values of B for an outlier score (it’s a HBOS implementation with KNIME as a component). This component itself is also executed within a loop - causing the error.
I have a test-setup of 17 outlier data sets - 16 have less than 100 columns, one has 1556! Looping over these data sets evaluating the performance of the implemented outlier detection (HBOS) gets stuck at the wide table!
Erich, I’m interested in outlier detection methodology. Could you please provide a link to the component. Also, if possible, upgrade to KNIME 4.5.1. It has number of changes to the columnar store.
Hi izaychik63,
once I’m finished I can share my outlier components - in the meanwhile I can recomend the Isolation Forrest (H2O-Extension) or blending external packages/tools such as PyOD or Elki.
Ad version: Unfortunately, on my (much stronger) ‘evaluation’ machine - where this error first happened - I have no permissions to do so.
On my ‘private’ machines I have 4.5.1 installed - and I just started my investigations on this bug! Although too early for any conclusion, I see a constant growth of the Heap size while looping, which even cannot be reduced by activating the garbage collector …
Best
Erich
use a cache node right in front of the operation and maybe force KNIME to write the result to disk.
try to revert to the traditional internal storage and maybe tweak the settings by going for gzip instead of snappy. Arrow was still in Labs before KNIME 4.5 and over the time (although it is a great format) the formats (namely Parquet) were not that stable (hopefully with the new columnar storage out of labs status that might change)
if possible you could think about doing work in chunks or splitting it into several Workflows (not very elegant or popular, I know)
And if it is about transferring data from and to Python (besides what I just said about Parquet :-)) I have use the generic format and Reader and Writer to transfer data between KNIME, R and Python without using the data connection (old version, new version). Also SQLite is an option.
Thx @mlauber71 for your suggestions, but I think I found the problem: Columnar Backend!
After some tests, I’m quite sure that the problem is caused by/related to Columnar Backend - it seems that a Memory leakage exists - in version 4.3.1 as well as in 4.5.1, on WIN and Linux!
My Test-SetUp:
I installed my WF (see above) with the same test-data (table with 1556 columns, 2200 rows) on a
LINUX machine with KNIME 4.3.1 having 16GB mem assigned to KNIME
WIN10 machine with KNIME 4.5.1 having 24GB mem assigned to KNIME
on both machines I performed tests configuring ‘Columnar Backend’ ON and OFF.
Here my observations: ad 1 LINUX machine:
Running my WF with Table Backend=Default, the WF succeeded without any problems. Re-running the WF with Table Backend=Columnar Storage(Labs) results in the above mentioned Memory was leaked by query.
ad 2 WIN10
Running my WF on WIN10 with Table Backend=Default, the WF also succeeded without any problems. The Heap status of KNIME (constantly changing slightly) was always below 5GB! Activating the Garbage Collector resulted in heap reduction.
Re-running the WF with Table Backend=Columnar Backend showed a constantly increasing heap size, activating the GC had no impact! Starting at 2GB heap size, I reached my mem limit of 24GB after some hundred loop cycles, resulting in a ‘frozen’ KNIME, consuming 100%CPU. So I had to kill KNIME
However, I have found my ‘work-around’ in NOT using Columnar Backend - too bad, that could have helped me a lot!
Hi Erich,
Thanks so much for the detailed investigation. That does indeed sound strange. We will have a look at it!
Would you be able to share the workflow with us so that we can investigate the problem with the exact same workflow as you have? If you can’t share it publicly but privately, we can also do this via email.
In general, the Columnar Backend involves multiple caches which explain that the memory usage over time will be different than if you were using the default backend. More details can be found here if you’re interested: Inside KNIME Labs: A New Table Backend for Improved Performance | KNIME
However, obviously, this should not cause memory leaks.
Thanks for the workflow, we have reproduced and identified the problem and created a ticket internally so that we will try to fix it for the next bugfix release of KNIME, so 4.5.2.
What happened in a nutshell is that we were keeping track of some objects to be able to close them properly (=free their memory) before they get garbage collected. However, in your nested loops, the amount of objects that we are keeping track of accumulates to such an extent that we run out of heap space. We’ll have to reduce the amount of objects created and need to improve the tracking of closeable objects.
Unfortunately that means you will have to use the “default” backend for now, but we’ll let you know as soon as we’ve released a fix for this problem in the “columnar” backend!
And regarding KNIME 4.3.1: At that time the Columnar Backend was still in “Labs”. It has changed and improved significantly in the meantime and thus should work much better. Actually, tracking the closeable objects as mentioned above is in place to prevent the memory leak detected by the Arrow Allocator. Unfortunately the amount of references grew too large in your workflow.