Workflow could not be loaded. Java heap space. - can we optimize the workflow load ? Lazy load of the data of nodes ???

calaba · June 14, 2014, 4:43am

Hello KNIME experts,

after playing with KNIME for a week - having aprox. 6 prediction flows (some pre-processing, normalizers, learners, predictors and post processing) - after last training of one brach of my model I hit a roadbloack that I couldn't load my workflow anymore - KNIME giving me an error - "Workflow could not be loaded. Java heap space" (with Xmx=12GB in knime.ini).

So I had to overcome this by setting Xms=16GB to be able to load a workflow which I created just couple of minutes before with 12GB settings. Plus it takes now AWFULLY LONG time to load the workflow! Another workaround to make the load faster is to drop some/all of your intermediate results to decrease memory requirements (or split the workflow to multiple). BUT You do not want to drop too much though - i.e. the learners took around 1 day to compute ... so you do not want to get rid of them for sure.

The size of the KNIME workflow on filesystem s approx. 8,5GB (and I know it is compressed data stored there). It's clear why I am getting the memory overflow error -> it is obviously trying to reload all/some of the nodes (including the data already calculated - gzipped "intermediate result" - which are in my case not tiny tables).

So far I had very pleasant experience with KNIME and it's ability to execute nodes on lot of data and in parallel so I can utilize my CPU cores efficiently. Thus I got surprised that I am running into workflow load issues -> it seems that there is no lazy load implemented for the data portion of the nodes ???

Maybe this problem is not generic but node class specific - in this case to provide more details: I was using WEKA 3.7 learners/predictors and some KNIME standard normalizers and also KNIME MLP learner/predictor.

Is there any chance to implement in KNIME a "lazy load" logic (ideally for all Nodes independently on who rpovided the coding logic) ?? What is the point to load all the data if no-one is requesting it. I can understand that data load is needed for execution. Maybe data load can be needed also for preview of node output (for discussion if you need to load all for preview). Also there should be an automated logic to discard data if they are not needed - seems GC still thinks data is needed ...).

Seems some other problems were already reported in past on the problem, I found those below, so it is an issue already existing couple of years - time to close the gap - huh? :-) :

wiswedel · June 16, 2014, 11:54am

Hi calaba,

Thanks for your detailed analysis. I guess what really kills the workflow is the Weka model, which is in-memory.

Generally KNIME will keep non-table objects in main memory whereas table objects are hold on disk and loaded lazy. Non-table objects are, e.g. PMML models (decision trees, ...), normalization models, color models, weka objects, .... everything that doesn't have the little triangle data port. Tables are ... tables (triangle port, the standard data type). There is much more memory handling code involved, including swapping to disk at runtime when memory gets low or lazy loading when the workflow is opened (the data is read when it's first accessed). The meta data (column names, types and domain) is always loaded as this is small.

You say you have learned some weka predictor. That is certainly kept in memory (in some weka object) - there is nothing we can do about it as it's external code except for erroring out earlier. What type of model is it? The model learners you find in "standard" KNIME are usually not that memory heavy and memory is not an issues (except maybe for the random forest ... which can get large depending on the settings ... and which is taken care of separately).

Not all that helpful but maybe clarifying a bit?

PS: As a workaround for the load problem you could just delete the large file from the workflow directory ... it should then load with errors but at least you can recover the rest of the nodes.

Ellert_van_Koperen · June 16, 2014, 3:36pm

Could it be a workaround to separate the generation and the usage of the model in 2 workflows, by saving the model explicitly at the end of workflow 1, and loading it in again in workflow 2 ?

calaba · June 22, 2014, 3:02am

Yeah, confirmed it is the Weka Nodes - either Learner or Predictor - they are so heavy duty that it makes use of them in KNIME almost impossible.

Even deleting the big file in the 'internal' directory doesn't help as it keeps trained object zipped in port_1 directory ... and who want's to delete trained model if it's training run several hours, right ??? And loading of this objects takes sooooooooooooooo long and needs sooooooooo much memory ...

Yes I am using Random Forrest from Weka - only Random Forrest implementation for KNIME I found ... i.e. MLP from KNIME DataMining is pretty good - reasonable speed and memory management ... but this Weka integration sucks - sorry to say that - was 1st thinking it's great as it offers a lot of functions of Weka - but I have to say this way it makes the whole concept unuseable for serious stuff only for small models it makes sense to fool around ... :(

wiswedel · June 22, 2014, 1:30pm

There is a RF implementation hidden in KNIME Labs. It's called Tree Ensemble Learner/Predictor. The node description should have the details how to get the RF setup.

In 2.9 this model is also memory expensive, specifically if you have many values in the target attribute (assuming you want to build a classifier?). This will improve a lot in 2.10.

How large is data, what are the attributes (all numeric?), how many trees do you want to build?

calaba · June 23, 2014, 1:27am

Thanx, found the Ensemble Learner and trying without Weka, looks good so far. My RF is 1500 trees, approx. 30 features, binary classifier only.