Memory/Heap space issues in version 4+

gpamukov · September 20, 2019, 9:49am

KNIME V4.0.1
Windows 10, 64 bit
16G RAM
-xMx10240m

Hello folks,

Not sure if specific to me - or to the particular dataset that I’m processing - but I’m experiencing severe heap/memory related issues in some nodes - like column appender, column list loop and even round double. Feels like memory leak - memory is not released and after some time/couple of iterations it results in out of memory exception. Had to revert back to 3.7 and it works without issues now.
This is just FYI - would appreciate if you advise on workarounds as well.

Thanks and BR,

wiswedel · September 20, 2019, 12:32pm

Whoa. Sorry to hear that. Can you create a simple workflow to replicate the problem? Do you see the memory usage slowly growing up to the limit (using KNIME’s memory bar) or does it happen spontaneously?

Is that node in question the only node executing at that time or are there are parallel branches executing also? If you think you can replicate the problem: Does it help pausing the execution before that node for a brief amount of time? (Not that I suggest to do that always but it might help diagnosing the problem further).

Any further insights appreciated!

Thanks,
Bernd

gpamukov · September 20, 2019, 4:03pm

Hey thank you for the quick response! Will try my best to create reproducible flow that doesn’t require all that data involved. Will send it over to you if I manage.

Cheers!

gpamukov · September 20, 2019, 4:07pm

And on your questions - no other processes running. Slowly building up. Only thing that helped is processing on small pieces and restarting KNIME after each batch. Same thing works just fine in 3.7, same ini settings, and in single chunk.

wiswedel · September 25, 2019, 5:51pm

Any news here?

Thanks,
Bernd

PS: At risk that I keep you from reproducing the problem and thus helping us to solve it: There is a FAQ that describes how to gradually revert KNIME memory policy to what it was in 3.7.x. (Then again, if you can give details on how to reproduce this will be appreciated a lot!!!)

gpamukov · September 26, 2019, 7:23pm

Hey Bernd,

Yes I saw this - and still will try to reproduce the issue to help you folks What you are doing is great and I really want to support it. Will be able to spend some time on this on the Weekend. If I don’t manage with dummy workflow - will be able to share the original one (that caused the issue) with you after my Kaggle competition finishes (private code sharing is forbidden). That will be next week.

Cheers

gpamukov · October 7, 2019, 8:58am

Hey Bernd,

Hope this finds you well.
Managed to reproduce it - and prepared workflow to simulate it.
Basically what I noticed is that in the initial steps memory is consumed and not released (will send you screenshot) - then I have component (Deeplearning4j learner) that uses more memory - and it crashes there.
If editor is restarted to release memory - then learner finishes successfully. BUT later in the prediction phase (it is column loop that embeds prediction of word vectors for each column) memory utilization builds up again and results in memory error after a couple of iterations. Only way to release memory is to reset the GUI. That same functionality works on one pass without any issues in 3.7. Please let me know what is the best way to send you the exported wf, init file and the screenshots.

Thanks and BR,
Georgi Pamukov

Mark_Ortmann · October 8, 2019, 9:57am

@gpamukov

great that you were able to reproduce the problem and I hope your Kaggle competition went well!

Either you just post your workflow etc. here, or you send a private message to Bernd or me - whatever suits you best (sent you a PM).

Looking forward to hearing from you
Mark

P.S. Could you share your knime.ini settings with us

gpamukov · October 9, 2019, 12:47pm

Hey - sent as PM.

Thanks and BR,
G.Pamukov

marc-bux · October 17, 2019, 1:07pm

Hi Georgi,

I think what you are observing is not a memory leak. Instead, it is a consequence of a new table caching strategy introduced in KNIME Analytics Platform 4.0.0. This strategy attempts to keep the k least recently used tables in memory until some critical heap space allocation threshold is reached. By default, k is 32 and the critical memory threshold is 90% of the heap space available to KNIME minus 128 MB. Note that tables held in memory that way are asynchronously written to disk in the background such that the memory they block can be released when said threshold is reached.

What I think happens when you run out of memory is the following:

The workflow runs smoothly for some time, tables are created and cached in memory. Memory consumption rises and tables are asynchronously written to disk in the background.
Some memory-intensive node (Deeplearning4j learner maybe?) attempts to allocate some large amount of memory, which, generally, KNIME nodes shouldn’t / won’t do without providing some kind of fallback on memory low conditions. If this happens at some point in time where cached tables cannot be released from memory, for instance due to the asynchronous background writers lagging behind, you can, sadly, run into an OutOfMemoryError.

To resolve the issue, you can switch to a less memory-consuming table caching strategy by putting the line -Dknime.table.cache=SMALL into your knime.ini. This way, only very small tables will be held in memory. It will make your average KNIME workflow slower, but it’ll be less memory-consuming.

In an attempt to verify my assumption, I’ve run the workflow you kindly provided. Here’s what I observed:

After starting up KNIME Analytics Platform 4.0.2 and opening the workflow, I ran a full-sweep garbage collection, upon which 122 MB heap space are blocked.
I ran the workflow. It executed until the Word2Vec Learner Node, which crashed with these two not-so-helpful error messages:
Execute failed: java.lang.ExceptionInInitializerError
Execute failed: The Deeplearning4J Library could not be initialized. Maybe there is not enough memory available for DL4J. Please consider increasing the ‘Off Heap Memory Limit’ in the DL4J Prefernce Page.
Unfortunately, the error messages persisted and did not get more verbose even after increasing the off-heap memory and checking the option to “Enable verbose logging”.
Anyways, at this point I’m pretty deep into the workflow and 7.4 GB of my heap space are occupied. I ran another full-sweep garbage collection, upon which 6.8 GB heap space are still blocked. This is due to the least-recently-used tables being cached in memory and only released upon memory alert, even though they have probably been written to disk in the background already. Obviously, if I save the workflow at this point and re-open KNIME Analytics Platform, I start out fresh with 122 MB heap space consumption.
However, instead of restarting KNIME Analytics Platform, I added 32 Data Generator nodes that generate 5400 rows of data each. I executed and then reset these nodes to flush KNIME’s table cache. I then did another full-sweep garbage collection and, voila, heap space is at 122 MB again, even though the relevant parts of the workflow is still executed up until the Word2Vec Learner Node.

I hope this helps to understand what’s happening. I’ll update this post if anything changes with regard to table caching strategies in KNIME Analytics Platform.

Best,

Marc

DiaAzul · October 20, 2019, 9:33pm

Thanks for the information, that is the missing piece in the jigsaw for me.

I’ve been suffering a lot with KNIME 4.0, in particular problems with heap allocations and excessive amounts of garbage collection as the heap fills up resulting in KNIME slowing to a crawl.

This tends to happen when the tenured regions in G1GC exceed the -XX:InitiatingHeapOccupancyPercent (IHOP) which by default is 45% of the total heap size. Once the tenured region reaches this amount then garbage collection becomes continuous and KNIME effectively freezes.

Data ends up in the tenured regions either because it has passed through Eden and Survivor, or quite fequently because it is a humongous allocation which bypasses Eden and Survivor and goes straight to tenured.

As the tenured region fills up to 45% the garbage collector is running continuously and, because Eden/Survivor doesn’t get to fill the remaining 55%, the heap doesn’t hit 90% occupancy and the k most recently cached tables are not released.

I’ve managed to ameliorate some of this isse by setting IHOP to 75%, which results in Eden/Survivor space startvation, however, I am still not hitting the 90% threshold for releasing the cached tables.

If most of the data is cycling through Eden/Survivor and not getting to Tenured then I am guessing that the heap does hit 90% occupancy and flushes the tables.

You may want to consider flushing the tables against the size of the tenured space. As tenured space will start driving garbage collection as it hits IHOP you may want to trigger a cache release when tenured hits 90% of the IHOP target, rather than the total heap utilisation hits 90%. As cached data is long lasting I would expect the cached data would be in tenured heap space anyway.

I could have misunderstood everything, but the new G1GC garbage collection and caching strategy is not working.

gpamukov · October 21, 2019, 7:42pm

Thanks Marc! Appreciate it!

marc-bux · October 22, 2019, 12:30pm

Hi @DiaAzul,

First off, thanks a lot for your detailed and very (!) helpful feedback.

Secondly, a small correction on my part: KNIME 4.0 makes soft-referenced data tables available for garbage collection when ~90% of maximum tenured region heap space is allocated. Having said that, it should not make a difference for the G1 garbage collector, since as far as I know G1 allows the tenured region to grow as much as needed until eventually it can take up virtually all heap space available to the JVM.

Nonetheless, I agree that allowing up to 90% of available heap space to be blocked by soft-referenced data tables (which likely end up in tenured space), can put a lot of strain on concurrent garbage collection. We therefore plan to downward-adjust this threshold. I agree that 45% being the default Initiating Heap Occupancy Percent (IHOP) is a strong indicator of 45% or slightly less being a reasonable choice. While we therefore might very well end up with that number, I am not quite convinced that it is the end-all choice. As far as I understand, G1 uses some mechanism called adaptive IHOP where the -XX:InitiatingHeapOccupancyPercent merely serves as an initial value, which is then adjusted at runtime.

In the meantime, let me reiterate that if you are having problems with memory allocation in KNIME 4.0, you can revert to the less memory-consuming table caching strategy of KNIME 3.7 and earlier by putting the line -Dknime.table.cache=SMALL into your knime.ini.

Finally, thanks once again for your input. It does help a lot

Best,

Marc

DiaAzul · October 22, 2019, 12:56pm

@marc-bux
Thanks for your feedback. Always happy to help.

Starting from the end, -Dknime.table.cache-SMALL makes a huge difference and my models are now running in a predictable manner.

I’ve spent a lot of time (too much time!) staring at heap allocation in VisualVM, and I have not seen G1GC increase the size of tenured heap beyond the IHOP setting. Doesn’t mean that it can’t, just that it doesn’t. What tends to happen is that when tenured regions reach the IHOP setting then it continually triggers a full garbage collection. This causes all resource to be diverted to GC and everything else stalls - unresponsive but not quite crashing.

I’ve got a stable system now, so happy. Though, would be nice to get the cached tables back. G1GC has money configuration flags, at some point it would be nice to have a guide for optimising it for knime. Not just for IHOP, but also the impact of region size and number of regions in the heap on performance, some of the thresholds for new allocations - probably won’t affect most people, but may be important for some.

Many thanks
DiaAzul

marc-bux · December 9, 2019, 11:17am

Hi @gpamukov, @DiaAzul,

I just wanted to briefly let you know that the recently released KNIME Analytics Platform 4.1 introduced some improvements to garbage collection of cached tables: Tables are now collected at a lower threshold and can be collected while being iterated over. Consequently, you should notice KNIME free up memory earlier and more reliably.

Let me know if you continue to observe heap space exhaustion in spite of these changes.

Best,

Marc

DiaAzul · December 9, 2019, 11:43am

@marc-bux, many thanks. I’m assuming things will work OK, however, it may take a month or two to uncover problems.

Best regards

gpamukov · December 13, 2019, 5:37pm

Already on it (4.1) - and feels pretty stable and fast so far. I’m processing sets as big as (many) tens of gigabytes on the Kaggle ASHRAE competition at the moment - no issues encountered.
Thanks for the amazing job you are doing folks!
Much appreciated!

system · June 13, 2020, 5:37am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.