Quick ways to improve KNIME memory usage...

pwisneskey · February 28, 2020, 4:56pm

So I have been trying to improve the performance of several workflows that process massive amounts of data and have been running into memory limitations when being executed in AWS FarGate containers. And I’ve made a very interesting discovery of two potentials ways to easily greatly improve the memory usage of KNIME particularly for string heavy datasets.

Basically, while researching to try to better tune the garbage collector KNIME uses for our workload, I discovered that the G1GC collector that is configured in knime.ini supports an option to deduplicate strings in memory. By enabling the option -XX:+UseStringDeduplication in my knime.ini I greatly reduced the overall memory consumption because every full garbage collection resulted in the deduplication of a large number of strings in memory.

So this got me curious about doing that deduplication more aggressively and I wrote a custom node to stream through a buffered data table and replace every StringCell with a StringCell whose value had been interned with the String class’ intern() method. That did result in memory savings for me downstream but my upstream nodes still had the duplicate string copies and I also had the overhead of the extra memory consumed by my custom node.

So I dug a little deeper. KNIME’s data cell design is very good because the data cell values are immutable and most nodes seem to take care to only replace/copy a data cell when they need to. So most extra duplicate allocations are avoided in branched paths, etc. However, the StringCell itself has its string value internally and this value is not deduplicated unless the garbage collector option is enabled.

So I’d like to propose that the constructor method of the org.knime.core.data.def.StringCell class be changed to intern() the string value parameter it is invoked with as follows:

public StringCell(final String str) {
    if (str == null) {
        throw new NullPointerException("String value can't be null.");
    }
    m_string = str.intern();
}

This would mean that any String cell’s internal value loaded into memory will only have one copy regardless of it was created in memory or streamed from a cache, etc. There is a small performance penalty to pay for the intern() but I believe the benefits to memory usage outweigh it, particularly if the data being processed has lots of repetitive string values.

If there is a concern about this performance and the size of the string pool behind String, the intern() could be made a configurable option for a workflow or KNIME environment. There is also the possibility of tuning the string pool size. I was referring to http://java-performance.info/string-intern-in-java-6-7-8/ for my initial explorations.

pwisneskey · February 28, 2020, 7:55pm

I’ve been doing some deeper investigation as I continue to tune my workflows and I’m a little wary of my proposed intern() change for the StringCell constructor since the string pool behind that is of limited size and I have not been able to confirm what type of logic is used to decide what stays in the pool.

But the string deduplication option for the G1GC garbage collector is looking like something I would strongly recommend KNIME consider adding to its knime.ini permanently. The deduplication done by that is not limited by the size of the string pool and it specifically targets longer lived objects. By default, it only targets objects that have survived three prior garbage collections but that can be changed with configuration parameters to make it more or less aggressive.

I’ve watched a few runs of my big workflows with the deduplication statistics enabled (-XX:+PrintStringDeduplicationStatistics) and it really has been making a considerable impact on the memory usage of KNIME. For reference, we are processing many thousands of rows of records that have a column with one of 255 countries and 150 different measure names.

Also, here is a good reference article on the pros and cons: https://dzone.com/articles/usestringdeduplication

pwisneskey · February 28, 2020, 8:36pm

Further reading about how to make String.intern() more effective by tuning the String pool size: http://java-performance.info/string-intern-in-java-6-7-8/

marc-bux · March 2, 2020, 10:40am

Hi @pwisneskey,

First of all, thanks a lot for the in-depth post and analysis. I agree with your points on String deduplication being a promising lead in reducing the memory usage of KNIME AP. While it really depends on the workflow at hand, there definitely are many cases where StringCells in the same column hold many duplicate Strings.

From what I’ve read and what you found as well, I am reluctant about the String#intern approach, since (a) depending on how String are created, they might be interned already, (b) there is a performance cost for invoking the native String#intern method, and, finally (c) the number of interned Strings is fixed (to some configurable number which defaults to 60,013), whereas the number of distinct String(Cells) can vary greatly between workflows.

Conversely, as you pointed out, G1’s -XX:+UseStringDeduplication looks like a good fit here, since it specifically targets long-lived objects, which StringCells usually are. I’ll go ahead and suggest that we evaluate / benchmark that option internally on a set of workflows, play around with the -XX:StringDeduplicationAgeThreshold option and consider adding it to our default. In the meantime, do let us know how using that option turns out for you .

Regards,

Marc

beginner · March 2, 2020, 11:46am

No directly and answer but might be of interest to you as well in regards to memory.

pwisneskey · March 2, 2020, 3:37pm

Thanks Marc! I agree completely with your analysis and preference for adjusting the garbage collector settings rather than changing the StringCell code. We have definitely seen an solid improvement in the memory usage of string heavy workflows.

I am also considering using Category to Number for some of our most repetitive string columns with fixed values (countries and categories for our modeling). That could reduce memory usage not just in RAM but also when they are streamed to disk though I already am assuming that the compression of data to disk is already quite efficient.

We are current chasing down a much more nefarious memory/resource leak that appears to occur when there are some combinations of nested recursive and parallel chunk nodes. It is occurring in our complex workflows but it also just arose in a self-contained workflow that I wrote for my next blog posting on genetic algorithms. In this workflow, each iteration of the outer loop is a new generation and each generation is the exactly same size so I would expect a consistent memory usage pattern but this is not the case. 5 iterations run very fast and then the workflow begins to slow to a crawl around iteration 8 and always fails with a heap error around iteration 10 regardless of my heap size settings.

We are working with Stephen Rauner on this and other issues and I can push my self-contained workflow to the KNIME hub if you all would like to try to replicate. However, I was going to see if I could reduce it to an even simpler workflow to show the issue without all the Monte Carlo simulations needed to do the genetic fitness scoring.

My gut instinct is that the issue is something with the recursive node not releasing resources as it should and that the parallel chunk loop inside of it just makes the issue occur faster. It seems like the issue occurs in both Knime 3.7 and 4.1 so I don’t think it is a regression in the latest version. If I can create a simpler test workflow, I can verify this as well.

CarlWitt · April 23, 2020, 12:58pm

Hi @pwisneskey,

thanks again for the proposal. We evaluated -XX:+UseStringDeduplication on several workflows and found that, although it can save memory on certain workflows, other workflows take severe performance hits (15%-25% runtime degradation).

For the average case, footprint reduction is rather small, as notable memory savings typically appear when processing 1M rows upwards. Thus we thought it’s probably best to leave enabling G1 String Deduplication to the more experienced users, who have studied the characteristics of their work flow.

However, thanks again for the valuable feedback!