Heap Space Error - Workarounds for GroupBy

Hi from Stuttgart,

at a "GROUP BY" node of a workflow (see pic; part of a elaborated JOIN metanode with 2 input and 3 output ports), I faced the famous "Java Heap Space" error.

I changed the settings in knime.ini to -Xmx16g (the machine has 16GB memory on board), but - even the process takes around 13G according to the windows task manager - the error occures.

Do you know any workarounds to process the two data streams (ca. 40.000 rows from input port 0, 300.000 from input port 1) joining them without this error ?

Best regards

Bernd

 

 

Hi Bernd,

could you please give us some more information about the settings of the GroupBy node such as the number of group columns, the selected aggregation methods and if you have enabled the "Process in memory" option. It would also help to know more about the data that you process with the GroupBy node such as the column types e.g. large documents or images or only numbers.

I'm asking these questions since they influence the amount of memory the GroupBy node requires for processing. Usually the node sorts the input table by the group column and then processes each group separately keeping only the information for one group in memory. If your input table has only one group and you have selected "List" as aggregation method for all remaining columns the whole input table will be read into memory and might cause a Java Heap Space error.

Bye

Tobias

Hi Tobias,

thanks so far for your reply.

Here the settings of GroupBy
Group Column: GLOBALID (Type: Number, ca. 300.000 different elements)
Aggregation: SOURCE (Type: String, Method: Concatenate)

I enabled the "Process in memory" option and selected "keep all in memory" of the "Memory Policy" tab.

Hope that helps.

Bernd

Hi Bernd,

I guess the problem is the enabled "Process in memory" option together with the concatenate. This results in a lot of memory consumption since all group values are kept in memory until the end. The group values themself can be also quite large depending on the size of th string in the SOURCE column since they all get concatenated into a single string. So I would suggest you to disable the "Process in memory" option as well as selecting the "Keep only small tables in memor" option in the "Memory Policy" tab and the node should run through without any memory issues.

Bye

Tobias

1 Like

... thanks, Tobias, it worked well.

Bernd