Memory issues with document classification

Hi

I use the sample Document Classification workflow, nothing special so far. When I import multiple documents I start to get performance issues like with the Java Heap Space or GC overhead. I already raised the according parameters in knime.ini & set the Memory Policy of many nodes to "Write tables to disc".

My current workaround is to copy the workflows in order to parrelize them and reduce the node workload the further the workflow proceeds. Then I concatenate them the latest before the Document Vector, but the maximum nr of files I can process currently is 97, and this only with many Eclipse restarts because the memory runs out, my PC has 16G RAM.

I just want to know if there is another way to import 100 or more documents (mostly .doc & .pdf), it takes a lot if time when there is a change.

thanks, hoky

Hi hoky,

better do not use the "Write tables to disk" option. The "Keep only small tables in memory" should work fine. "Write tables to disk" will slow down the process since the complete data is buffered on disk.

Which version of KNIME / Textprocessing are you using? How many GB have you specified as Xmx in the knime.ini? How large are the pdf / word files i.t.o. disk space and ~word length?

At which point do you get the memory problems, which node? The Keygraph Keyword extractor can be expensive from a memory point of view. You could try to extract the important terms with the frequency filter node based on a frequency that you have computed before (e.g. TF).

Make sure to use FileStore Cells in the preferences (File->Preferences->KNIME->Textprocessing->Storage).

Cheers, Kilian

Hi Kilian

ok, thanks. Is it possible that the Xmx value in eclipse.ini is relevant as well ? I set this to from 512M to 3840M. It works better now. Here my answers:

Which version of KNIME / Textprocessing are you using?

KNIME 2.9.4

How many GB have you specified as Xmx in the knime.ini?

16G - all my physical memory

How large are the pdf / word files i.t.o. disk space and ~word length?

Approx. 100M disc space

At which point do you get the memory problems, which node ?

Many nodes had troubles, like BoW Creator, Case Converter, Term Grouper, Keygraph Keyword Extractor, Document Vector.

cheers, Holger

Hi Holger,

the Xmx parameter need to be set in the knime.ini file. If you are using KNIME SDK you start KNIME out of th SDK. There you can specify in the Run Configuration how much Xmx can be used for the KNIME process.

Since KNIME 2.9 there exists direct preprocessing for the Textprocessing. This means that you do not need to apply the filtering and preprocessing nodes, such as Case Converter, etc. on a bag of words. you can apply these nodes directly on the list of documents.

For Example:
PDF Parser->Case Converter->Stop Word Filter->....->Keygraph Keyword Extractor->Document Vector
PDF Parser->Case Converter->Stop Word Filter->....->BoW Creator->TF->Frequency Filter->Document Vector

Create the bow at the end, before you apply the Document Vector node.

Cheers, Kilian


Hi Kilian
thx a lot.

Here is what happened: I used plain eclipse with a KNIME plugin and not the KNIME workbench, that's why the changes in knime.ini didn't but the eclipse.ini impact the performance.

I have another question to the document classification:
Do you have an idea how I can make sure that the one single document I want to let classify is also part of the test set ? Since the classification works well with 100 docs I want to test it know with single docs based on the training set.
Is there a way to tell the Partitioning node to take a certain, single record into the test set ? If it is a random method it might be impossible, or can I set a flow variable with the file name or use the joiner ?

thx again
Holger

Hi Holger,

I am not sure if I understood you correctly. To select one single document for the test set and iterate of the documents you can use a x validation loop (X Partitioner node and X Aggregator node). In the X Partitioner node dialog specify "leave one out", for leave one out x validation.

If you simply want to select one certain document as test set, without looping/iterating over all documents use the Row Splitter node instead of the Partitioner node. The Row Splitter node works like the Row Filter but with two out put tables. You can filter e.g. by the row id to get one certain document as test set and the other documents as training set.

If you want to use a document which is in your training set also in your test set, use the Row Filter node to filter this particular document and than the Concatenate node to append this row to the data table containing the other test documents.

Cheers, Kilian

Hi Kilian

thanks, I will test it, I think you understood what I want.

cu, Holger

Hi Kilian

fyi: it worked, I used the Row Filter & Splitter - thanks.

Bye, Holger