GC overhead limit exceeded and/or Java Heap Space Issues

Hi,

it|s me again with a problem related to GC overhead limit exceeded/Java Heap Space issues.

I'm using ubuntu 14.04LTS on an Intel® Core™ i7-4790 CPU @ 3.60GHz × 8 machine with 16GiB RAM and some SSD disk. So one should think performance is good enough for more ressource consuming calculations.

knime.ini: -Xmx5120m which is more than enough enough?!

My question is basically wether or not I'm asking for too much when I'm trying to...:

...read in a table with 95.000 rows containing a document with quite some text and aprox. 60 columns  used for categories (simple string), send it through some preprocessing nodes and then let the keygraph keyword extractor node extract about 10 keywords per document (so 95.000 times 10 keywords). This is the point where I get one of those two errors.

or

read in a table with a Document column and one single category column, put it through some preprocessing, after that let the keyword extractor extract about 15 keywords per document (20.000 times 15) allocate the category as classes (category to class node), prepare a 70:30 partition (partioning node) to finally send it to the decision tree learner, SVM learner and NB learner (and their predictor nodes). The partioning is where I get the same errors (heap space or GC). Before that, with the node still running I get some warnings like "Potential deadlock in SWT Display thread detected. Full thread dump will follow as debug ouput".

Using less rows in above stated situations (like only 30.000 in the first case) everything works OK until the partitioning. But when it comes to the learner Nodes, everything breaks down again throwing some Java Memory related Errors (like: "ERROR SVM Learner (deprecated) 2:42       Configure failed (OutOfMemoryError): GC overhead limit exceeded").

So I'm starting to think that KNIMEs weak spot is that it's based on Java. I reckon such an Analysis tool should be able to handle such amount of data, right?

I'd be really thankful for any information you can give me about that situation. Of course, if you need some more information or (parts of) the workflow(s), please let me know.

Cheers,

Manu

 

Hi Manu,

you are right, your machine is strong enough to handle that data. KNIME is able to handle many 100k to mios of rows on a regular laptop like yours. However, sometime you might run into problems like you did which is for some reasons:

a) Text processing is more expensive (cpu and memory wise) than processing of "regular" numerical or string data, since heavy lifting needs to be done to handle the quite complex Document types/cells.

b) The Keygraph Keyword extractor node is implemented unfortunately not memory optimized. This node should be avoided if you have bigger amounts of documents. Use tf idf values instead to extract the most important terms withe the highest tf idf values. The workflow could look like this:

Parsing->Preprocssing nodes ...->Bag of Words Creator->TF->IDF->Math Formular (tf*idf)->Rank (by TFIDF, group by Document)->Row Filter (rank 1 to 10).

The trick here is the Rank node (rank by TFIDF, group by Document) to rank the terms for each document and be able to filter afterwards by rank. This results in the N most important (highest ranked) terms. The Rank node as well as the Row Filter are memory optimized and can handle bigger data tables.

Be aware that the Rank operation might take a while, since grouping and thus sorting is required. However the node will not crash.

c) Preprocessing nodes can and always should be applied before the bag of words is created (see example above). Preprocessing is much faster and more efficient ito memory if ot is applied directly in the documents instead of a bow.

d) About the SVM (or other learners): filter our the document column before training the model. The document column should not be part of the features. Only keep numerical or categorical cols as features. Documents as feature make no sense. How many feature cols do you have and how many rows for training and testing?

For b) and c) I attached an example workflow.

I hope this helps.

Cheers, Kilian

Maybe one summarizing comment: the memory problem you are struggling with is a problem of specific nodes (Keygraph Keyword Extractor) not a problem of Java, or KNIME.

Cheers, Kilian

Hello,

I'm experiencing the same problem with GroupBy node, reading a table with more than 77m rows and only three columns.

I'm running Knime on a Windows 7, 64-bit OS, 4gb RAM,

I've already increased the heap size with Xmx3gb in the knime.ini file but still the node doesn't work.

Any suggestion? 

Thanks in advance

1 Like