Iteration through large table of collection cells

acassin · September 12, 2010, 11:42am

Hi,

I have a table of 7000 rows each with a collection cell (list) of some 47 DNA sequences (a copy of the collection for each row). But iteration through the table consumes over a 1GB of memory with:

while (it.hasNext()) {

DataRow r = it.next();

// increases java heap space up to 1GB

}

in the nodemodel execute() method: is this normal? Is there any way to manage the iteration to avoid this behavior?

thanks in advance,

wiswedel · September 13, 2010, 12:18pm

Do you get an OutOfMemoryError or do you just notice that the memory consumption goes up (using the heap status bar in the lower right corner of the KNIME GUI or some OS system tool such as the 'top' command)?

Java has an internal memory management and often uses more memory than is actually needed. A lot of the memory is occupied by dead objects and is freed as soon as memory gets low or there is some idle time.

However, if you get OutOfMemoryErrors, this would indicate a bug.

Please clarify! Thanks,

Bernd

acassin · September 14, 2010, 3:22am

Hi,

An OutOfMemory exception is thrown and the java heap space (in the knime gui) also increases up to the -Xmx limit before dying. I tried to invoke System.gc() every few iterations, but that doesn't seem to free the dead objects.

thanks for your prompt reply,

wiswedel · September 14, 2010, 11:37am

Can you give details on how large the DNA sequence is? I suppose it's just a very long string but how long? Did you aggregate those using the "Create Collection Column" node?

I would like to reproduce this problem.

Thanks,
Bernd

acassin · September 15, 2010, 1:23pm

Hi Bernd,

The collection of Strings are introduced to each row via the java snippet (String array return) rather than the collection cell nodes: easier to get the collection into each row that way. The box plot summary for the length of the 47 DNA strings is as follows:

Minimum	659
Smallest	659
Lower Quartile	1121
Median	1605
Upper Quartile	1845
Largest	2479
Maximum	2979

hope this helps,

wiswedel · September 30, 2010, 10:26pm

I can reproduce it but only if I cache the data during the iteration (or putting it into a new DataContainer). If I then set the "Memory Policy" to "Write to disc" things are good.

Can you confirm this?

acassin · October 3, 2010, 9:58am

Hi,

After 20% of the rows, the heap status shows a heap size of 110M (seems pretty stable). So I'd confirm that write to disc make memory usage better. Note that the node I am developing uses the three argument form of exec.createDataContainer() with maxCellsInMemory set to zero.

Is it expected that the knime user will know to "write to disc" and which node to set it on?

cheers

wiswedel · October 8, 2010, 4:52pm

Thanks for clarifying. At least we now the origin of the problem. It's a limitation of how KNIME determines the size of a table (by counting elements, not by looking at the size of the elements).

We are working on improving this but you need to live with it for the time being.

wiswedel · January 22, 2015, 3:41pm

Just came across this old thread and want to close it in case others are reading old posts also: Since KNIME 2.8 there is a memory watcher at work that takes care of large tables. If memory gets low all tables are swapped to disc to avoid memory errors.