GC/Memory problems with 'N Chars Filter' within a Textclassification WF

Hi Kilian,

I have a 'GC overhead limit exceeded' in a textclassification Problem with (from my point of) not a very high number of documents with moderate size (7000 text files with avg. size of 5k).

I'm using Knime 2.11.1 on a Win 7 machine with only 4G memory. I've assigned 3G to Knime in the Knime.ini. My WF looks like that:

{Read Files}->'String Manipulation'->'Rulebased Row Filter'->'Punctuation Erasure'->'N Chars Filter'->...

{Read Files}  is a meta node, where I'm reading a' 500 flat text files from 14 directories (500*14 = 7000), with a series of 'Flat File Ducument Paerser'->'Concatenate' nodes in a Metanode. In 'String Manipulation' I calculate and store the size of each of the 7000 files. The 'Rule based Row Filter' removes all short documents (~200 are removed). After that the remaining 6.800 files are passed to the pre-processing nodes, where each having set the flags 'Deep processing' and 'Append unchanged ..'

Letting this WF run results in 'GC overhead limit exceeded' while exectuting the 'N Chars Filter' node, after having reached 75% of its progress and causing ~100% of CPU load (shown in the Task Manager). The 'Heap status' of KNIME shows 2.3G of 2.6G reserved. Without success, I've tried the following things:

- Reducing the 'File store chunk size' to 1 (or 1000)

- Changing/experimenting with the GC settings in KNIME.INI

- Changing/experimenting with Xmx a bit (2G, 2.5G, 3.5G)

The only workaround I've found (without reprogramming the WF) is to run each node (producing such a GC exception) after a restart of KNIME, having stored the results of it's predecessor before!

But why is this node 'N Chars Filter' producing such a GC load - and why can the memory not be cleaned accordingly? For fullfilling its task, each document can be parsed separately. Furthermore, once having got a 'GC overhead limit exceeded' also the manual triggering of the GC does not free any subst. amount of memory -my sample got stuck at '1.8G of 2.6G'.

Of course, my machine has limited MEM ressources - but 7000*5k data should be manageable.

Is there something wrong with my WF or settings? Shall I use 'Chunk Loops' to break down the pre-processing in several, smaller data sets?

Thanks in advance!

Erich

Hi Erich,

that is strange. I can not reproduce this behavior. With KNIME 2.11.1 with 3GB Xmx I can preprocess 25.000 documents without any problems. Also parsing 10.000 files and creating documents with the Flat File Parser is no problem. The memory is completely freed afterwards. My file store chunk size is set to 11.000. Parsing and preprocessing runs even with 2GB Xmx only.

Are you creating a bag of words before you apply the preprocessing nodes, such as punctuation erasure or n chars filter? If so, better apply the preprocessing nodes directly after the parser node.

Do you use any tagger node before preprocessing?

Another thing you could try is to write the documents (right after the Flat File Parser) to disk using the Table Writer node. Then close the worklfow. Then create another workflow and read the data back in using the Table Reader node and do the preprocessing in this second workflow.

Cheers, Kilian

1 Like

Hi Kilian,

thanks for the quick reply. First of all its good to hear that it's no basic problem and the data volumes are manageable.

No, I'm using no 'Bag of Words' node - I think this is obsolute (for preprocessing) since having the 'Deep processing' options. Right?

No taggers are used - the WF I described Is the graph, producing the problem. There is another - parallel WF to preprocessing - being fed by the 'Rule based Filter' node - which is NOT executed when the GC Exception occurs.

I'll try your recommendations and will came back to you.

Best

Erich

Hi Erich,

yes, your are right. The Bag of Word nodes is absolete here.

I just tried to read and process 20.000 txt files reaching from 2KB to 30KB with the Flat File Document Parser (followed by Punctuation Erasure and N Chars Filter) with 2.5GB Xmx for KNIME. It works fine on my laptop.

Try to store the documents after creation as described above and process them in an separate workflow.

Cheers, Kilian

Great tip here Kilian. The archive is a really valuable tool - this was exactly the problem I had and your fix worked perfectly. Just for my education, why putting BoW after the preprocessing make such a difference?

Regards

David

Hey @DATHXL,

Let’s say you have a table with one document containing 10 unique words (one row in total). After using the BoW, you would have a table with 10 rows, each row contains the document and one of the words. The preprocessing nodes would now be applied to all of these 10 rows, although we could also apply it to only one document if we had used it before the BoW. Additionally, the preprocessing nodes would not filter the terms in the Term column that was created by the BoW node. So, the Term column would not fit the content of the documents anymore.

Best,

Julian

2 Likes