I have a workflow that is handling hundreds of thousands of rows with long strings each. This caused to make the saving and the loading very slow and took lot of space on the hard drive. This seems to be a big data problem.
Is there a way to avoid the load of the whole dataset in memory and use it only when it is called from the hard drive. I changed in my nodes the memory management to “keep only small table in memory”.
And I edited my ini file to use only 1000 cells in memory. ie -Dorg.knime.container.cellsinmemory=1000
I am aware that knime is becoming good at handling big data.
Are there special nodes or data formats to restructure and manage my big table of large strings (ie. something that is free for academic use)?
which KNIME version are you running? What is you -Xmx setting in knime.ini? What’s the source/origin of your data (data base, local files, etc.)?
Hi there @zizoo,
have you seen this blog post about optimizing KNIME workflows? It is really good so take a look. It might help you understand how KNIME works and hopefully optimize it for your needs
I am using version 3.7.2
and in my Knime.ini
I get the problem when I try to save a big table containing long strings to the hard drive using PDB saver from MOE.
The node starts properly and after 10% (saving 23GB of data) , it crashed with this error:
ERROR PDB Saver 0:9 Execute failed: java.lang.OutOfMemoryError: Java heap space
@ipazin, Indeed I followed the advice from that link.
Can I suggest you try the
PDB Saver node from the Vernalis community contribution:
Also, it might be worth wrapping some of your workflow, including the
PDB Saver in a wrapped metanode (component if you are in KNIME 4.0) and using the streaming execution. See
or search the forum for Streaming.
Basically, this reduces the amount of time spent writing intermediate tables to disc, at the expense of not being able to view their contents until the end of the wrapped metanode. It might also be filtering out the PDB column once you have finished with it using a
Column Filter (i.e. after saving the PDB files, and before the wrapped metanode output)
I think you have a couple of different questions. So let me first answer the question stated in the title:
Basic steps to speed-up the execution of your workflow:
- Install KNIME 4.0
- Increase -Xmx in your KNIME.ini to whatever you have at disposal (the more the better)
- Make use of the streaming nodes
- Optimize your workflow execution plan
- Read through the blogpost linked by @ipazin
What I’m understanding from your explanations is that you have a data set with 23Gb and want to process it with KNIME, however you only have 2GB RAM (Heap Space) available (your -Xmx value). Since with 2GB Heap Space KNIME can ever keep the whole table in memory (23GB vs 2GB) every node has to read the input from the disk and write the output immediately to disk causing long execution/saving times.
How to tackle this “problem”:
- Increase your Heap Space, i.e., set -Xmx large enough
- You can disable the compression via the KNIME.ini. This results in larger files written to your disk, but faster reading/writing. Note that with KNIME 4.0 we have a new default compression which is way faster than the previous default while still achieving decent compression results
- Use streaming nodes, as they don’t write intermediate tables to disk. Note that not everything can be handled in a pure streaming fashion, e.g., joining tables.
- If possible store the files to a (local) data-base and pre-process them using the DB-nodes
- If possible split your data into smaller chunks, process them individually and concatenate the results at the end
In general however, even with your settings the workflow should be executable, but require a lot of time due to the heavy IO and data size. According to your previous post, however, I assume that you have a node that creates some kind of data structure that requires more Heap-Space than offered to KNIME causing the out of memory error.
- Increase your heap-space, i.e., set -Xmx large enough
- Check if that node has an out-of-memory option (I’m not talking about the Memory policy tab here). See for example the GroupBy node, there you have such an option.
If both solutions don’t work we’d have to look into the implementation of that node. Could you please tell me what the name of that particular node is. I strongly assume that the heap-space exception is caused by a particular node, while others work perfectly fine?
Looking forward to hearing from you
I think my problem was solved by increasing the Xmx and cutting my data into reasonable chunk and run them over a loop.
I am thinking to move to Hive as I read it deals well with big data. But I am not sure whether I will be limited incase I need to use certain node with the usual tables.
This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.