Memory problems causing server to shut down KNIME (it's driving me crazy)

stevenwatterson · September 5, 2022, 11:30am

Hi everyone. I’m desperately hoping that some one can help me here, because I can’t for the life of me see a solution and I’m running out of options.

I’ve been running KNIME to build classifiers for high dimensional data for a while now. ~400 features, ~300 observations. The worksheets that I’m running are highly iterative in that there is forward feature selection, hyperparameter optimisation and bootstrapping. Unfortunately, when I try to run these worksheets, I very quickly run out or memory. I’m not running these on a laptop or desktop, I quickly run out of memory on a server with 320GB of RAM.

I’ve searched through all the previous posts on memory management and I have switched on the garbage collector button at the bottom of the KNIME screen, I have included the Varnalis Heavy garbage collector node in my worksheets and I have set the -Xmx2048m switch in KNIME.ini. None of these things work! (Help!) The % of server RAM that the java virtual machine occupies still grows and grows once the worksheet has started, until the server has no choice but to kill KNIME (with the error Memory pressure relief: total: res = 122757712/12275712/0, res+swap=7475200/7475200/0).

If I were coding the analysis, it wouldn’t have any memory problems as I could simply reuse the same vectors/matrices in each iteration. The only way that I can think to explain the huge growing memory demand would be if all the tables created on each iteration of every loops is retained in memory. 1000 bootstraps x 50 hyperparameter optimisations x 1000 FFS = ~50,000,000 iterations, so I could see how this would clog up the server memory if all the data for 50,000,000 iterations of the loops were retained.

The leads me to ask about garbage collection. I know nothing about how the Java VM works, so how does it know which data structures to mark for garbage collection? Is there a way within KNIME to specify at the end of a loop that the variables created inside the loop are to be sent to garbage?

Help! If I can’t find a solution, I’ll have to abandon KNIME and start coding the analysis in python.

Thanks for reading this long post. The contents of my knime.ini are below if it helps.

Thanks for any help you can provide.
Steve.

(base) [steve@stratmed0 knime_4.6.0]$ more knime.ini
-startup
plugins/org.eclipse.equinox.launcher_1.6.100.v20201223-0822.jar
–launcher.library
plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.2.100.v20210209-1541
-vm
plugins/org.knime.binary.jre.linux.x86_64_17.0.3.20220429/jre/bin
-vmargs
-Darrow.enable_unsafe_memory_access=false
-Darrow.memory.debug.allocator=false
-Darrow.enable_null_check_for_get=false
–add-opens=java.security.jgss/sun.security.jgss.krb5=ALL-UNNAMED
–add-exports=java.security.jgss/sun.security.jgss=ALL-UNNAMED
–add-exports=java.security.jgss/sun.security.jgss.spi=ALL-UNNAMED
–add-exports=java.security.jgss/sun.security.krb5.internal=ALL-UNNAMED
–add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED
-server
-Dsun.java2d.d3d=false
-Dosgi.classloader.lock=classname
-XX:+UnlockDiagnosticVMOptions
-XX:+UseG1GC
-Dsun.net.client.defaultReadTimeout=0
-XX:CompileCommand=exclude,javax/swing/text/GlyphView,getBreakSpot
-Dknime.xml.disable_external_entities=true
–add-opens=java.base/java.lang=ALL-UNNAMED
–add-opens=java.base/java.lang.invoke=ALL-UNNAMED
–add-opens=java.base/java.net=ALL-UNNAMED
–add-opens=java.base/java.nio=ALL-UNNAMED
–add-opens=java.base/java.nio.channels=ALL-UNNAMED
–add-opens=java.base/java.util=ALL-UNNAMED
–add-opens=java.base/sun.nio.ch=ALL-UNNAMED
–add-opens=java.base/sun.nio=ALL-UNNAMED
–add-opens=java.desktop/javax.swing.plaf.basic=ALL-UNNAMED
–add-opens=java.base/sun.net.www.protocol.http=ALL-UNNAMED
-Xmx2048m
-Dorg.eclipse.swt.internal.gtk.disablePrinting
(base) [steve@stratmed0 knime_4.6.0]$

ArjenEX · September 5, 2022, 1:01pm

Hi @stevenwatterson

A question that immediately comes to mind is where you currently have the garbage collection located within your workflow, and more importantly in relationship to the loops.

Regarding how memory usage works on the server, this post explains that pretty well in case you haven’t seen it yet!

DiaAzul · September 5, 2022, 1:21pm

@stevenwatterson

In your KNIME.ini file you are setting -Xmx to 2Gbytes. This sets the maixmum amount of heap space that KNIME can use. Therefore, I suspect you are rapidly exhausting the heap space with your calculation - it doesn’t look too onerous.

You should be able to set -Xmx to 200Gbytes (-Xmx200g) on your server which may fix the problem.

DiaAzul

MichaelRespondek · September 5, 2022, 1:29pm

Hi @stevenwatterson,

You have to set the -Xmx option to a proper amount of heapspace. I would recommend to set it to 120GB using the following line within your knime.ini:

-Xmx120g

Although your machine would provide space for a higher amount of configured heap space I would stop at this 120GB because each full Garbage Collection (GC) runtime directly depends on the amount of heap space. As the JVM freezes all running processes (and so the communication as well) during the Garbage Collection you will experience around 120 seconds of freezing when the GC kicks in (the full GC will take around 1 second per GB of heap space as a rule of thumbs).

Please change the -Xmx setting within your knime.ini as recommended and restart the KNIME Analytics Platform afterwards so that the new setting is used.

Best,
Michael

stevenwatterson · September 5, 2022, 3:12pm

Hi @MichaelRespondek, @ArjenEX, @DiaAzul

Thank you all for getting back to me. I’ve increased the xmx size as you’ve suggested and set this running. It’ll take 5 or 6 hours for the memory footprint of the worksheet to grow sufficiently to test the limits.

Sorry if these are silly questions, but in the online forums xmx2048m seems to come up regularly as a sensible base configuration (for laptops). What determines whether the Java VM can ignore the xmx limits and run riot with the RAM? Naively, I would have thought that the VM would always be constrained to the specified amount however small (or crash with an error if it was too small).

Thanks for your help.
Steve.

ArjenEX · September 5, 2022, 3:20pm

Sorry if these are silly questions, but in the online forums xmx2048m seems to come up regularly as a sensible base configuration (for laptops).

That’s most likely due to the fact that those users were all having similar laptops for which the available RAM memory wasn’t that great. Common practise is to allocate a certain percentage of your available RAM to KNIME. I’d say 70-80% is usually a good mark. It depends on what else you have running. In my case, I have 16GB available and allocate 12 to KNIME which gives me room to also run a few other applications on the side while developing and running workflows. On the server we use (dedicated only to KNIME), it’s close to 95%.

DiaAzul · September 5, 2022, 3:20pm

@stevenwatterson

The Java VM cannot grow the heap space beyond the limits set by the Xmx parameter. This is to ensure that sufficient memory is left over for (a) other memory uses by the Java VM such as stack, buffers and other tasks consuming memory and (b) the operating system and other applications. If the Java VM was allowed to consume memory without limit there would ultimately be a situation where it becomes impossible for the OS to allocate itself memory and a deadlock could ensue.

The internet is now old technology and so quite a few comments relate back to the days when a powerful 32-bit Windows laptop may have had 4Gbytes of RAM. Given that only 3.5Gbytes was useable, a limit on the Java VM of 2Gb was a sensible maximum RAM allocation. I don’t know how people coped with such limited amounts of memory, but fortunately technology has advanced and we have more scope to really mess things up.

MichaelRespondek · September 5, 2022, 3:27pm

The configured heap space should be used as a maximum (beside a small additional memory to start eclipse), so that it seems to be wrong configured on your machine. Are you trying to use the columnar storage backend? This will utilize additional memory beside the JVM heap space. External scripting via nodes like the ones of the R or Python integration will also use additional memory.

The -Xmx2048 setting is often cited as this is the default setting of the KNIME Analytics Platform. But it should be set to a higher amount if more memory as heap space is available (within the limits I already mentioned in my last post).

Nested loops will build up very huge result dataset structures kept in memory which could cause the JVM to crash if it runs into a “OutOfMemory” error. Maybe it would be better to split the processing into chunks?

Best,
Michael

stevenwatterson · September 5, 2022, 4:35pm

Thanks @MichaelRespondek, @DiaAzul, @ArjenEX

Unless I’m misunderstanding what I’m reading, growing beyond the xmx limits would appear to be exactly what is happening.

I haven’t tried the columnar storage backend and can give that a try. I am using a python script node for oversampled class balancing which may be contributing.

@MichaelRespondek, when you say “Nested loops will build up very huge result dataset structures kept in memory which could cause the JVM to crash if it runs into a “OutOfMemory” error.” This is exactly my problem. Do you mean that the garbage collector will never collect anything inside a loop (in which case my problem is fatal) or just that it is possible to create dataset structures that will cause the JVM to run into out of memory errors?

stevenwatterson · September 6, 2022, 9:17am

Hi @MichaelRespondek @DiaAzul @ArjenEX

A quick update. I set the worksheet running with xmx200g which represents about 63% of the RAM on the system. However the JVM is now up to 65.1% and continuing to grow.
I can see how columnar storage backend might make some efficiency improvements, but I need to solve the mystery of why the JVM is not stopping at the xmx limit. This happens at 2g and now at 200g.

Above you mentioned complications from python integration. I have a python script node, but it’s at a high level of the nested loops so isn’t called many times. I wouldn’t expect it to contribute heavily to the memory bloat. The data types and structures are all fairly trivial: tables/vectors of numbers.

Is there a reason that intermediate data types inside inner loops would not be marked for garbage collection? Is there a way to force this?

Thanks for your patience.
Steve.

stevenwatterson · September 6, 2022, 1:47pm

Currently upto 82% of RAM.

DiaAzul · September 6, 2022, 8:51pm

The Xmx only limits the size of the heap; Java also uses memory for other activity such as just-in-time compiled code, memory for each thread (stack storage), file buffers and other uses which do not relate to the storage of objects. Typically this memory usage is small relative to the size of the heap.

It is incredibly difficult to provide any meaningful suggestions without looking at the workflow, the settings and how the system works in operation. It could be a trivial problem, or difficult, but without diagnostic tools to track what is happening everything is guesswork.

You may want to download and use VisualVM. This tool monitors programme activity and memory usage and may help to track down the root cause of your problem.

The alternative is the classic programmer approach of de-constructing your workflow and testing each part piece-by-piece until you identify possible causes for the problem.

DiaAzul

stevenwatterson · September 7, 2022, 3:57pm

Hi @DiaAzul Thanks for your reponse.

I’ve tried digging into Java’s RAM usage and the results are bizarre. Thanks for pointing me towards VisualVM. It’s telling me that the heap space is fine and stable. The Heapspace, metaspace, number of threads and numer of classes are all stable. This is while the Java memory consumption (as displayed in top) grows and grows.

I’ve also tried using jcmd to get an idea of the native memory consumption and everything reported by jcmd looks fine. Everything is well within manageable limits.

I’ve tried switching off the optimiser (Xint), limiting direct memeory allocation (XX:MaxDirectMemorySize=1G) and limiting directbytebuffers (Djdk.nio.maxCachedBufferSize=1000000), all with no success.

When browsing Stackoverflow, I found this question which seems very similar to what I’m experiencing, but it’s beyond my ability to address.

Thanks
Steve.

DiaAzul · September 7, 2022, 9:45pm

@stevenwatterson

I would put that on the list of less likely to be the cause, on the basis that the issue is remedied when the garbage collector runs. As you have already used the garbage collector aggressively in earlier posts, this mitigates against the stackoverflow case.

Given what you are reported - the heap size is stable, bit non-heap memory is increasing - leads me to suggest another avenue of exploration.

I have, in the past, when creating Java script nodes noticed that KNIME slows down significantly when handling exceptions. This occurred when a cells contained null data (question marks). To improve processing speed I had to check for nulls and handle them gracefully within the Java script code to ensure that an exception was not thrown. Historically, KNIME nodes that had to process null cells were also slow, though there has been an improvement with some of the more recently updated nodes (e.g. Joiner). Whether that improvement is due to changes in the way that exceptions are pre-empted and dealt with I do not know, only the KNIME software developers can say.

My suggestion is:
1/ Check the KNIME log file to see whether there are any errors, warning or info statements. This may indicate that there is a problem in the workflow that needs to be addressed.
2/ If there are a lot of info or warning statements that have no implication for the calculation then consider changing the reporting option in the preferences KNIME-> Log file level to Error (I think the default is warn). It could be there are a lot of messages being queued in a buffer whilst they are waiting to be written to disc. You may want to set the log level to info to see if it provides additional information that helps with discovery and then set it back to warn or error when doing full processing.
3/ Check all of your data to replace any missing data (null values) with a value. This will reduce the chance of any null value exceptions being raised.
4/ Sanitise your workflow for any data which may lead to any other exception in any of the nodes. This should be apparent from the log file.

The reason that I am focused on exceptions is that when an exception is raised it is passed to the JVM to process. The JVM then scans the stack for an exception handler and passes the exception to the handler to process. If there are a lot of exceptions being raised, and the exception handler is unable to accept them quickly enough (for instance if they are being processed and an event written to the log file) then this will cause the exceptions to be queued leading to increased memory usage over time.

I’m guessing, so usual caveats apply.

DiaAzul.

kienerj · September 8, 2022, 8:39am

Instead of fixing the RAM issue, would it not make sense to question the actual procedure?
What model are you using for the classifiers?
Does it make sense to have more features than observation? (in general no, and even less so depending on model)
does it make sense to do forward feature selection in this case?

Personally I very much dislike any type of such feature selection as it is computationally way too expensive and you are trying a gazillion of possibilities greatly increasing the chance of finding “random correlations”.

The features should first be reduced either by domain knowledge or in case of calculated features (like chemical descriptors) I would also consider if all of them are really needed. Filtering them out by correlation and low variance can often reduce them to a more meaningful number. If interpretability is not relevant, then also PCA or the likes could be used.

In essence if 320 gb RAM doesn’t do it with 300 rows and 400 features, it the procedure that is the issue.

christian.birkhold · September 8, 2022, 9:02am

@stevenwatterson question (just to be sure): Is the new Columnar Table Backend activated? As the heap is stable etc issues with garbage collection or any other Java related memory allocation are out of question afaik.

Edit: Just saw, that you didn’t active Columnar Backend. The reason I’m asking is, that the Columnar Backend actually does out of heap memory allocation which has all kinds of advantages (especially, when you want to use a lot of memory, which in case of heap memory might cause the garbage collection to go crazy). If you would have switched on Columnar Backend, then I would have suggested to play with the caching parameters in the preferences and actually reduce the Xmx to 32 max and rather fix the off-heap to 200GB++.

In your scenario however, it seems that some KNIME node actually contributes to off-heap memory and doesn’t seem to clean it up. This could either be some node allocating (and not releasing) off-heap memory or a Python Script leaking or… Only way to figure out which node is actually leaking is to disable parts of the workflow and try to narrow down the specific part of the workflow leaking.

If there is no leak, it could be a problem with Linux itself: I once ran into a similar weird (and annoying) memory leak scenario when we initially developed the Columnar Table Backend. The solution was found here (Linux kernel bug): Troubleshooting Problems With Native (Off-Heap) Memory in Java Applications - DZone Java (See point 3). Can you try to set the env variable either specific for the KNIME process or globally and test if the off-heap keeps increasing?

stevenwatterson · September 8, 2022, 10:19am

Hi @DiaAzul Thanks for that suggestion. There might be something to it as I do get a lot of warnings from the ensemble of classifiers that are a part of the workflow. I’ve tried switching the logging from warn to error as you’ve suggested, but it hasn’t had any effect. However, I’m going to spend a little time curating the hyper-parameter optimisation as I do get a lot of warnings coming through the console which must have consequencies for memory management (definitely for efficiency).

Hi @kienerj Thanks for getting in touch. Ordinarily, I’d agree with you that any analysis that places too great a demand on the systems is a bad analysis. However, in this case, what I’m attempting to do shouldn’t place a great demand on the system. It’s a reasonably pedestrian, just very iterative.

stevenwatterson · September 8, 2022, 11:01am

Hi @christian.birkhold Thanks for getting in touch. I had found that article before, but we’re at the limits of my knowledge of linux/java and most of the large allocations coming up on pmap didn’t seemed to be labelled. Unfortunately, I’m finding that setting MALLOC_ARENA_MAX and MALLOC_CHECK isn’t having any affect either. I had removed the python scripting node and replaced that with SMOTE (which led to a huge performance speed up).

Anyone who’s interested can download the worksheet here and you can see that I’m not attempting anything crazy. Looks like it’ll have to be old-fashioned debugging.

Thanks for everyone’s suggestions.

christian.birkhold · September 8, 2022, 11:23am

@stevenwatterson is the data private or would it be possible that you share it? I’d like to run that workflow myself and see if I also get a leak. Also, I assume you experience the memory leak when running the top one with “SMOTE”, right?

stevenwatterson · September 8, 2022, 11:53am

Hi @christian.birkhold . Yes, the leak continues with SMOTE in the top pathway. I’ve a list of these sorts of analyses to run and I’m just trying to crack it in one analysis first. I can’t post the actual data here as it’s regulated, but I can make up some data to post that will have the same poblems.