These things come to my mind.
- would it be an option to use the Cache node at the end of some transformations in order to bring all the data together in one place before moving on in the loop so as to make it easier for the system to do the temporary storage
- it might also be an option depending on the size of your data to tell all the nodes before in a Metanode to do their processing in memory and then do a cache which you force to store to disk
- then is there an option to run the heavy garbage collector *1) in order to reduce ‘clutter’ (I have no idea how this would affect your outcome just something that came to my mind)
It could very well be that you would have to resort to a combination of measures. Maybe something like:
- resort to the original internal storage (zip instead of parquet)
- do the jobs within the Metanodes in memory
(- force the metanodes to run one after the other and not in parallel) - cache at the end of every iteration to disk
- move on
- (try some garbage collection inbetween - which of course might slow down the system or have some other unintended consequeces)
I ran some large data processings on a Windows server and my impression was that a strategic use of cache nodes before Joins could make the process more stable - although I have no ‘scientific’ evidence for that
The last thing I could think of is store the result of every iteration (or every nth) within a parquet file where you would point the HDFS folder of your local KNIME Big Data environment and then adress them all via a Hive extrenal table (so you would hopefully save the loop). Admittedly I have never actually tried this with very large numbers of files on a local machine
*1)