Java Heap Space Issues/GC overhead limit exceeded

Hi,

I already addressed my problem in following post:

http://tech.knime.org/node/55380/view

Maybe my description was a little bit too long and encapsulated.

After trying to run my workflow with different sizes of data (decreasing it from try to try), it still gives me either "GC overhead limit exceeded" or "Java heap space". Funny thing is, I tried the same ammount of data on the same workflow with identic settings twice. First time I got the "GC overhead limit exceeded" error second time I simply got the "Java heap space" error message, without changing anything between the two executions. How is that possible? I tried to get into this java heap space topic, but it seems there's no clear answer on that.

Also, I read somewhere that the advantage of KNIME to other analytic tools is supposed to be the fact that you can use huge loads of data using KNIME because you have the option to "write tables to disc", which takes the load off the memory. For all my nodes I'm using this option. Apparently this function is simply not stored (placebo) or am I getting it wrong.

I attached a screenshot of the workflow (part) which I want to run. I read in a table with the size of 13.6MB (19997 rows, 31255 columns). The partitioning node splits the table 70:30 for learning testing. My SVM Learner (overlapping penaltiy 2.0; polynomial 1.0/1.0/1.0) so then gets said errors between the execution state of 70% and 99%).

If someone could explane it to me I would be very thankful.

Manu

 

The two error message basically describe the same situation, that the programs requires more memory but it cannot get more.

It's true that KNIME can handle very large amounts of data. However some operations or algorithms require the whole data to be in memory. For most learner nodes, such as the SVM learner, this is the case. Also the model that is built needs to fit into memory completely.

If you have enough memory in your computer, you may increase the available memory for KNIME, see https://tech.knime.org/faq#q4_2.

Hi Thor,

thanks for your quick answer.

Interessting to know, that most learner nodes need everything in memory. Do you have any further information (url to articles etc) on that (just out of curiosity...I want to approach it in kind a researching way)?

I already got my configurations set to Xmx12288m. I really thought that should be enough...

Thanks,

Manu

As soon as an algorithm needs random acces to data (which the majority of machine learning algorithms do) then the data is usually kept in memory. In principle you can work around it but this would slow down the algorithm considerably.

I just looked at your number and one reason for the high memory consumption is also the number of columns you are using (31,255). While we don't keep the complete data in memory we keep metadata for every column in memory. And for such a large number of columns this can be quite a lot. BTW, having 31,255 features with only 19,997 rows doesn't make much sense, even more if you are splitting the data to train on an even smaller set. If the number of features is greater (or even near) the number of rows the learned models are usually pretty useless.