GroupBy: Execute failed: Java heap space

exo-kn · February 19, 2010, 4:55am

hi there!

i am using the (latest version) 2.1.1.0023926 and i experienced some problems trying to analyze data with a mba (market basket analysis). my data consists of 200.000 single transactions which have to be groupes to the single baskets. to check the process i used a set of 500 transactions (everything worked fine). now i tried to analyze 50.000 transactions. (btw, process is File Reader, Domain Calc, One2Many, GroupBy, Bitvector Gen., AssRuleL)

now i get the following error message of the GroupBy-node: “ERROR GroupBy Execute failed: Java heap space” even though i selected the “Write tables to disc.” option in the memory policy. who can help? is knime maybe not capable of analyzing such an amount of data?

the system i am using is a 32 bit intel quadcore xp-prof machine, 4 gig ram.

tobias.koetter · February 19, 2010, 12:32pm

Hi,
could you please give some more information on the data and the aggregation method you use in the GroupBy node. The GroupBy node is implemented to use as less memory as possible that’s why the node sorts the input table first by the group column and performs then the aggregation step. It processes the table group wise and keeps only the data in the memory which is needed for the aggregation per group. Some of the aggregation methods consume more memory since they keep all (distinct) values per group in memory (e.g. the unique methods, List, Set, Concatenate, …). If you have only one group or few groups for a large table and choose one of the memory intensive aggregation methods the node consumes more memory.
The enable hiliting option also consumes memory since it maintaince a map of the original and new row keys in memory.
Besides these effects the group by node normaly does not consume a lot of memory.
Bye,
Tobias

exo-kn · February 19, 2010, 8:40pm

hi tobias,

the data i am using looks pretty much like that:

transactionID articleNR
1000001 1234
1000001 9876
1000001 4567
1000002 4231
1000002 7895
1000002 3210
1000002 1234

after using the one2many node i get
transactionID articleNR 1234 9876 4567 …
1000001 1234 1 0 0…
1000001 9876 0 1 0…
1000001 4567 0 0 1…
1000002 4231 …
…

then i use the groupby node with the aggregation method "maximum"
so i expect to get:

transactionID 1234 9876 4567 4231 7895 …
1000001 1 1 1 0 0 …
1000002 1 0 0 1 1 …

group column is transactionID and all the other columns (about 2000 total) are used. maximum unique value per group is set to 80, hilting is disabled, sorting in memory disabled, retain order disabled, aggregation method (column name)

hope that helps! and thank you for your help!

tobias.koetter · February 22, 2010, 3:07pm

Hi,
sorry for the delayed answer.
The groupby node does not consume a lot of memory for your settings. It creates a maximum aggregation class for each aggregation column (2000) that keeps only the maximum value per group. It keeps only 2000 DataValues in the memory which isn’t that much.
Did your application already cosumed a lot of memory before starting the groupby node and crashes thats why? Could you do me a favor and switch on the heap status of KNIME in order to monitor the actual consumed memory in KNIME. To show the heap status open the File menu in KNIME and go to Preferences. In the Preferences dialog go to the General section and tick the ‘Show heap status’ option and close the dialog by clicking on the ok button. KNIME should show a memory bar in the lower right corner that displays the consumed and left memory. Could you than please have a look how the memory consumption changes while the groupby node is running.
Thanks,
Tobias

tobias.koetter · February 23, 2010, 8:50am

Hi,
I have performed some more tests and I could enforce an out of memory exception for an simmilar example like yours by setting the available memory to 64MB and changing the Memory Policy of the GroupBy node to Keep all in memory. After changing the Memory Policy to the default(Kepp only small tables in memory) the node executed successfully. Could you please check the Memory Policy of the GroupBy node. It should be set to ‘Keep only small tables in memory’. The Memory Policy is the right most tab of the node dialog.
Thanks,
Tobias

exo-kn · March 20, 2010, 4:58pm

hi tobias

thank you for your answers. please excuse my delay in ansering back! i tried again what i already described obove and switched on the heap space monitor. it says 508 mb available space and before running the group by node for the first time 20 mb of used space. while running it, it rises (in waves) and whenn it reaches the 508 mb knime cancels the process and gives out the java heap space error. after cancelling the used space goes back to about 16 mb. any idea?

exo-kn · March 20, 2010, 11:15pm

next thing i just tried is to save the results of the processes before in a csv-table. then i loaded only this table and added the groupby node to it. same result! btw, the size of the table was 1,1 gb (it contained ~560 million values!) i really asume that knime cannot work with such an amount of data.

tobias.koetter · March 23, 2010, 7:28pm

Hi,
I could now reproduce your problem. The OutOfMemory exception occurs during the sorting of the input table by the selected group columns. We will have a deeper look into the sorting implementation and find a solution for the problem. Unfortunatley I don’t have a work around for the problem yet but I will talk to my colleagues if they have an idea.
Thanks again for the comment.
Tobias

wiswedel · March 28, 2010, 5:32pm

We found the problem: The sorter currently uses a fixed threshold to determine the size of the chunks to be sorted in memory and doesn’t respect the memory requirement of the rows (numeric cells can be significantly different from string cells in terms of memory footprint). This can cause problems when you have many columns (and many rows).

We will enhance the sorting for v2.2 by determining the chunk size based on the available memory, not on a fixed cell count threshold. We will also improve the group-by node by not sorting the data for “trivial” aggregation methods such as max/min/mean.

Until v2.2 becomes available you can try to tweak the threshold parameter by adding a line -Dorg.knime.container.cellsinmemory=10000000 to the knime.ini file. This will increase the cell count threshold (which consequently reduces the number of temporary buffers). I was able to sort a (numeric) table with 1Mio rows x 2K columns (20GB data table) with 1G heap size, so I’m postive that you can also use the group-by node on your data set.

Thanks,
Bernd

acuevas · October 31, 2010, 2:57am

Hii!! Im trying to do that... but I want use Apriori node, however when I use "one2many node" don´t happen anything, so i don´t know what can i do ..can u help me please ???? =)

acuevas · October 31, 2010, 4:05am

Now I understood the problem, and I fixed it :), however im using Apriori Node to extract rules associations but when the process is in 50% this fail and send this error message : ERROR Apriori Execute failed: Weka associator can not work with given data. Reason: Apriori: Cannot handle numeric attributes!, however I changed to String values but I got another problem : ERROR Apriori Execute failed: Weka associator can not work with given data. Reason: Apriori: Cannot handle string attributes!

So I don´t understand the problem.. please help me, i Need to do this urgently :(

Thanks :)

ritika · May 15, 2013, 8:23am

Hi,

I am trying to concatenate 2 tables with ~90MM rows each. i keep getting the following error- Execute failed: Java heap space

I even increased the space upto 40 GB, but to no effect. Please suggest! thanks.

wiswedel · May 15, 2013, 5:46pm

Are you using the GroupBy node to do that (as suggested by this thread's topic)? Otherwise, if you are using the "Concatenate" node it might be a problem related to the duplicate key handling. If you know the IDs of the tables are unique you should use the "Duplicate row id handling" -> "Fail execution" option as it otherwise needs to keep the row IDs in memory. If they are not unique you need to make them unique using the RowID node.

Bernd

DavidC · May 23, 2014, 2:34pm

Hello,

I'm experiencing the same 'Java heap space' error when I try to merge two files (33 and 48 million records, same structure, duplicate IDs) using the Concatenate node (with the 'Write tables to disc' selected in the 'Memory Policy' tab). Any solution?

Thank you for your support,
David

Operating System: Windows 7 pro 64-bit
Installed memory (RAM): 8 GB
KNIME 2.9.1 (32-bit version)

wiswedel · May 25, 2014, 12:33pm

As per my previous comment it has to keep the keys in memory in order to de-duplicate:

Two solutions:

As per comment 13 make sure to have unique IDs when feeding data to to the concatenate node.
Increase the memory KNIME is allowed to use (default is only 512MB for KNIME, 32bit). That is: Get KNIME 64bit and then increase the heap.

Hope this helps,
Bernd

DavidC · May 26, 2014, 2:08pm

Thank you for your support Bernd. I cannot use the 64-bit version because I have to access MS Office 32-bit files. However, the Concatenate node worked perfectly with unique IDs in the input files.

Thanks again!