How to make the workflow load and deal with my big data faster

zizoo · July 14, 2019, 2:21pm

Hi,
I have a workflow that is handling hundreds of thousands of rows with long strings each. This caused to make the saving and the loading very slow and took lot of space on the hard drive. This seems to be a big data problem.
Is there a way to avoid the load of the whole dataset in memory and use it only when it is called from the hard drive. I changed in my nodes the memory management to “keep only small table in memory”.
And I edited my ini file to use only 1000 cells in memory. ie -Dorg.knime.container.cellsinmemory=1000

I am aware that knime is becoming good at handling big data.
Are there special nodes or data formats to restructure and manage my big table of large strings (ie. something that is free for academic use)?

Mark_Ortmann · July 15, 2019, 7:21am

@zizoo,

which KNIME version are you running? What is you -Xmx setting in knime.ini? What’s the source/origin of your data (data base, local files, etc.)?

ipazin · July 15, 2019, 1:19pm

Hi there @zizoo,

have you seen this blog post about optimizing KNIME workflows? It is really good so take a look. It might help you understand how KNIME works and hopefully optimize it for your needs

https://www.knime.com/blog/optimizing-knime-workflows-for-performance

Br,
Ivan

zizoo · July 15, 2019, 5:24pm

@Mark_Ortmann ,
I am using version 3.7.2
and in my Knime.ini
-Xmx2048m
I get the problem when I try to save a big table containing long strings to the hard drive using PDB saver from MOE.
The node starts properly and after 10% (saving 23GB of data) , it crashed with this error:
ERROR PDB Saver 0:9 Execute failed: java.lang.OutOfMemoryError: Java heap space

@ipazin, Indeed I followed the advice from that link.

Thanks,
Zied

Vernalis · July 16, 2019, 8:35am

Can I suggest you try the PDB Saver node from the Vernalis community contribution:

https://kni.me/n/BqBLQ4I7LITqcKbZ

Also, it might be worth wrapping some of your workflow, including the PDB Saver in a wrapped metanode (component if you are in KNIME 4.0) and using the streaming execution. See

https://kni.me/w/ivCG-3SDoOO7hlIz

or search the forum for Streaming.

Basically, this reduces the amount of time spent writing intermediate tables to disc, at the expense of not being able to view their contents until the end of the wrapped metanode. It might also be filtering out the PDB column once you have finished with it using a Column Filter (i.e. after saving the PDB files, and before the wrapped metanode output)

Steve

Mark_Ortmann · July 16, 2019, 8:39am

@zizoo,

I think you have a couple of different questions. So let me first answer the question stated in the title:

Basic steps to speed-up the execution of your workflow:

Install KNIME 4.0
Increase -Xmx in your KNIME.ini to whatever you have at disposal (the more the better)
Make use of the streaming nodes
Optimize your workflow execution plan
Read through the blogpost linked by @ipazin

What I’m understanding from your explanations is that you have a data set with 23Gb and want to process it with KNIME, however you only have 2GB RAM (Heap Space) available (your -Xmx value). Since with 2GB Heap Space KNIME can ever keep the whole table in memory (23GB vs 2GB) every node has to read the input from the disk and write the output immediately to disk causing long execution/saving times.

How to tackle this “problem”:

Increase your Heap Space, i.e., set -Xmx large enough
You can disable the compression via the KNIME.ini. This results in larger files written to your disk, but faster reading/writing. Note that with KNIME 4.0 we have a new default compression which is way faster than the previous default while still achieving decent compression results
Use streaming nodes, as they don’t write intermediate tables to disk. Note that not everything can be handled in a pure streaming fashion, e.g., joining tables.
If possible store the files to a (local) data-base and pre-process them using the DB-nodes
If possible split your data into smaller chunks, process them individually and concatenate the results at the end

In general however, even with your settings the workflow should be executable, but require a lot of time due to the heavy IO and data size. According to your previous post, however, I assume that you have a node that creates some kind of data structure that requires more Heap-Space than offered to KNIME causing the out of memory error.

Potential solutions:

Increase your heap-space, i.e., set -Xmx large enough
Check if that node has an out-of-memory option (I’m not talking about the Memory policy tab here). See for example the GroupBy node, there you have such an option.

If both solutions don’t work we’d have to look into the implementation of that node. Could you please tell me what the name of that particular node is. I strongly assume that the heap-space exception is caused by a particular node, while others work perfectly fine?

Looking forward to hearing from you

Mark

zizoo · August 3, 2019, 12:42pm

Hi @Mark_Ortmann,
I think my problem was solved by increasing the Xmx and cutting my data into reasonable chunk and run them over a loop.
I am thinking to move to Hive as I read it deals well with big data. But I am not sure whether I will be limited incase I need to use certain node with the usual tables.
Thanks,

system · February 2, 2020, 12:42am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.