Hi Kilian,
I have a 'GC overhead limit exceeded' in a textclassification Problem with (from my point of) not a very high number of documents with moderate size (7000 text files with avg. size of 5k).
I'm using Knime 2.11.1 on a Win 7 machine with only 4G memory. I've assigned 3G to Knime in the Knime.ini. My WF looks like that:
{Read Files}->'String Manipulation'->'Rulebased Row Filter'->'Punctuation Erasure'->'N Chars Filter'->...
{Read Files} is a meta node, where I'm reading a' 500 flat text files from 14 directories (500*14 = 7000), with a series of 'Flat File Ducument Paerser'->'Concatenate' nodes in a Metanode. In 'String Manipulation' I calculate and store the size of each of the 7000 files. The 'Rule based Row Filter' removes all short documents (~200 are removed). After that the remaining 6.800 files are passed to the pre-processing nodes, where each having set the flags 'Deep processing' and 'Append unchanged ..'
Letting this WF run results in 'GC overhead limit exceeded' while exectuting the 'N Chars Filter' node, after having reached 75% of its progress and causing ~100% of CPU load (shown in the Task Manager). The 'Heap status' of KNIME shows 2.3G of 2.6G reserved. Without success, I've tried the following things:
- Reducing the 'File store chunk size' to 1 (or 1000)
- Changing/experimenting with the GC settings in KNIME.INI
- Changing/experimenting with Xmx a bit (2G, 2.5G, 3.5G)
The only workaround I've found (without reprogramming the WF) is to run each node (producing such a GC exception) after a restart of KNIME, having stored the results of it's predecessor before!
But why is this node 'N Chars Filter' producing such a GC load - and why can the memory not be cleaned accordingly? For fullfilling its task, each document can be parsed separately. Furthermore, once having got a 'GC overhead limit exceeded' also the manual triggering of the GC does not free any subst. amount of memory -my sample got stuck at '1.8G of 2.6G'.
Of course, my machine has limited MEM ressources - but 7000*5k data should be manageable.
Is there something wrong with my WF or settings? Shall I use 'Chunk Loops' to break down the pre-processing in several, smaller data sets?
Thanks in advance!
Erich