I’m running a workflow with a simple join that is taking me c.20 minutes to run with a sample of my the actual data I will be using (only 45 seconds in alteryx) and then each node afterwards is taking c.20mins. The data is c.90 million rows. How can I reduce the run time as I’m aware this will take hours once I run the full dataset? Is there a way I can stop the nodes from caching each time to make this run more quickly?
Then: which version of KNIME are you using? The Joiner Node has been redesigned and also offers some configuration like doing the merge in memory (or not) - cf. Options / Performance. Another idea could be to use the cache node right before the join
Then actually there is a way to avoid having the data saved with the workflow - this will maybe not help you that much.
Another thing could be to try the columnar storage format that might also improve performance:
Then I can offer my collection of articles and links about KNIME and performance. One other candidate in a corporate environment has been an aggressive virus scanner that one might be able to tame.
Then in general when it comes to performance streaming is an option (just sending operations thru a pipeline per lines and not doing everything node by node). This will not help you much with joins which will need the whole dataset.
Also keep in mind if your data source are databases then do it using the db nodes.
And is the join done based on the right data types. Would also check that.