Java Heap Space Issues/GC overhead limit exceeded

thor · February 19, 2016, 9:22am

As soon as an algorithm needs random acces to data (which the majority of machine learning algorithms do) then the data is usually kept in memory. In principle you can work around it but this would slow down the algorithm considerably.

I just looked at your number and one reason for the high memory consumption is also the number of columns you are using (31,255). While we don't keep the complete data in memory we keep metadata for every column in memory. And for such a large number of columns this can be quite a lot. BTW, having 31,255 features with only 19,997 rows doesn't make much sense, even more if you are splitting the data to train on an even smaller set. If the number of features is greater (or even near) the number of rows the learned models are usually pretty useless.