Reducing Memory load in Knime - optional data storage in memory (checkpoints)?

eduece99 · July 13, 2012, 3:12pm

Hello all, this is my first post :)

I've been using KNIME recently and have already found myself running out of memory. I'm fairly sure that the key offending point is the fact that every KNIME node stores data after it has been executed.

Is there any way of making this storage of data optional (like Pipeline Pilot and checkpoints)? I can see myself in a situation at some stage where I need to perform a memory-intensive calculation, such as mapping a few thousand compounds in multi-dimensional space (would need memory for pairwise distances of say, 10000 compounds, so that's 10000^2 pairwise distances which need to be processed = lots of memory needed, probably at least a gigabyte). If I have more nodes after that storing all this data, I can see real problems occuring.

Any help would be welcome (and no, I already have enough memory so I'm not installing more).

Ed.

richards99 · July 13, 2012, 9:04pm

I can understand your point, I did just what you did. Pair wise distances on 7000 molecules, and a host of other nodes afterwards. I ended up with my workflow storing 9 gigabytes of data and taking forever to load up. It also crashed our in house workflow server when colleagues tried to download the gigantic workflow.

Simon.

rsherhod · August 1, 2012, 10:29am

Doesn't the "Write tables to disc" options in the "Memory policy" tab help this? I thought it serialised the data and cleared it from the heapspace until you needed it?

If that doesn't work then I guess one or more nodes is running out of memory before it can complete its function.

eduece99 · August 22, 2012, 11:40am

Interesting, I'll keep that in mind.

Still, I think it'd be handy to make the storage of data in memory/disc per node optional (like it is in, dare I say, Pipeline Pilot).

Ed.

berthold · August 22, 2012, 10:53pm

Actually, the ability to inspect data coming out of every node at any time is a feature of KNIME that many users highly value - which, dare I say, seems to be the reason why some other tools suddenly have additional caching nodes ;-)

However, this discussion is related to two very different concepts:

a) KNIME giving access to data coming out of all nodes in a workflow which is not at all limited by available main memory (those tables are cached out to disc if they become to large);

b) KNIME running out of memory if a poor implementation of a particular node does not make proper use of these caching strategies (rsherhod correctly points that out).

It seems you Ed, really ran into a case of (b). Simon is, of course, also right, that at some point when the intermediate data really does become very large then quite some space on disc (NOT main memory) will be spent storing this. But that rarely ever seems to be a problem and in those rare cases one can often help by using the chunker loop construct to make those intermediate tables smaller. And the (IMHO even rarer) remainder of those cases where this still is a problem we'd like to understand better to make sure we can address those. KNIME does not instantly load the table but really only does so upon access...

Cheers, Michael

akosgmbh · May 10, 2013, 11:27am

I have installed the 64 bit version of KNIME and have 8GB of RAM.

I want to create an SDF from InChI names. I used the RDKit node (InChI to RdKit) and the Indigo node (Molecule to Indigo). I set the memory option to "Write tables to disc". This works fine for a small test. However, I run out of memory very quickly using a large file. I guess the limit of structures is around 40000-60000 InChI names. I have a million.

Best regards,
Alex

Aaron_Hart · May 22, 2013, 4:03pm

This sounds like a situation where you could use a chunk loop in order to reduce memory consumption. Have you considered this? If not, please have a look or post an example workflow here and I'll try to modify it for you.

Best Regards,

Aaron Hart
KNIME.com