Hello, I’m running and testing some workflow using the KNIME Streaming Executor (beta).
But what I wonder is what is the best number of chunks given the volume of data.
In a workflow with more than 32 million lines it performed very well with 50 chunks, however with lower volumes the workflow performed slower than without Streaming.
the description says that for large volumes higher values have better performance, but how much bigger? from 50 to 1000 or from 50 to 100,000? etc
Any tips on how to use this feature as optimally as possible?
I know the feature is beta, but it has been available for a few years, what precautions would you take to use this feature on KNIME Server?
Lower chunk size means more synchronization overhead. Higher chunk size means it takes more time till second third etc nodes start working. So it depends on the actual nodes you use and how heavy the calculation is. Trial and error. Personally I think 50 is very low. I would try 1000 and 10,000 to compare.
Streaming is most useful with IO in combination with simple manipulators like row filters, string manipulation etc. So when you read a large file the first rows can already be processed file the file continues to be read. On top of that no need to save the state after each node which will save a lot of time with lots of rows.
In contrast nodes that to complex calculations in a multi threaded manner, streaming likley will reduce performance!
Thanks for the answer @kienerj .
Cool, I’ll test with higher values. Yes, I noticed that it works better with I/O and simple nodes as you mentioned, especially when bringing data from the database through DB Reader
Let us know your results.
Just keep in mind that if it’s possible to do it directly in the database it will always be faster
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.