Iteration through large table of collection cells

Hi,

I have a table of 7000 rows each with a collection cell (list) of some 47 DNA sequences (a copy of the collection for each row). But iteration through the table consumes over a 1GB of memory with:

while (it.hasNext()) {

    DataRow r = it.next();

    // increases java heap space up to 1GB

}

 

in the nodemodel execute() method: is this normal? Is there any way to manage the iteration to avoid this behavior?

 

thanks in advance,

Do you get an OutOfMemoryError or do you just notice that the memory consumption goes up (using the heap status bar in the lower right corner of the KNIME GUI or some OS system tool such as the 'top' command)?

Java has an internal memory management and often uses more memory than is actually needed. A lot of the memory is occupied by dead objects and is freed as soon as memory gets low or there is some idle time.

However, if you get OutOfMemoryErrors, this would indicate a bug.

Please clarify! Thanks,

 Bernd

Hi,

 

An OutOfMemory exception is thrown and the java heap space (in the knime gui) also increases up to the -Xmx limit before dying. I tried to invoke System.gc() every few iterations, but that doesn't seem to free the dead objects.

 

thanks for your prompt reply,

Can you give details on how large the DNA sequence is? I suppose it's just a very long string but how long? Did you aggregate those using the "Create Collection Column" node?

I would like to reproduce this problem.

Thanks,
  Bernd

Hi Bernd,

 

The collection of Strings are introduced to each row via the java snippet (String array return) rather than the collection cell nodes: easier to get the collection into each row that way. The box plot summary for the length of the 47 DNA strings is as follows:

Minimum 659
Smallest 659
Lower Quartile 1121
Median 1605
Upper Quartile 1845
Largest 2479
Maximum 2979

 

hope this helps,

I can reproduce it but only if I cache the data during the iteration (or putting it into a new DataContainer). If I then set the "Memory Policy" to "Write to disc" things are good.

Can you confirm this?

Hi,

 

After 20% of the rows, the heap status shows a heap size of 110M (seems pretty stable). So I'd confirm that write to disc make memory usage better. Note that the node I am developing uses the three argument form of exec.createDataContainer() with maxCellsInMemory set to zero.

 

Is it expected that the knime user will know to "write to disc" and which node to set it on?

 

cheers

Thanks for clarifying. At least we now the origin of the problem. It's a limitation of how KNIME determines the size of a table (by counting elements, not by looking at the size of the elements).

We are working on improving this but you need to live with it for the time being.

Just came across this old thread and want to close it in case others are reading old posts also: Since KNIME 2.8 there is a memory watcher at work that takes care of large tables. If memory gets low all tables are swapped to disc to avoid memory errors.