@knightyboy welcome to the KNIME forum
These things come to my mind:
- joins are costly so it makes sense to plan ahead
- KNIME seems to be reworking the join nodes: i cannot find the post right now maybe @ScottF can help
- then it makes sense to try and do one join/step at a time
- i tried to use the cache node before a join on both sides in order to bring the data together and avoid additional internal copying (Execute failed: Cannot read file "knime_container_20200301_1905559478659870280.bin.snappy" - #19 by mlauber71)
- then you could try and force the join node to do everything in memory and see if this is getting any faster
- then if a workflow becomes very large it tends to use more RAM. It might help to split it into several workflows and maybe call them from a parent workflow (no guarantee there - and of course the effort to rework the whole thing) - you could try with you two largest join files to see
- use the garbage collector to free memory
- you mentioned Azure: is this some virtual server? I sometimes had the impression that virtual or shared drives over networks make for poor KNIME performance. Local SSD fare much better. If this is the case caching again might be an option
- you would have to see if caching is better to be done on disk or in memory. Memory might be faster - but there is always a price to pay
- then you could try and change the internal storage from zip to parquet (cf. Execute failed: Cannot read file "knime_container_20200301_1905559478659870280.bin.snappy" - #2 by mlauber71). Although you might be careful with that
Then concerning your data
- make sure you do not join large numbers of missing (or dummy IDs or something) against each other. That would also take more time
- complex strings could also be a problem
- I have thought about but never tried to build a reference file with numbers to my string IDs. Then join the numbers to each file and then join by numbers (that might be more costly than just living with the longer time) but combined with other approaches it might be a measure of last resort
And well if everything else fails. You could try and employ a larger system or a database. Or you could try and see if the SQL server could handle some joins.
You asked for speed but one more experiment could be to use Hive and the local big data environment. In theory Hive should be able to at least finish the job. But that would be more of an issue if your job would not finish at all.
OK since we are at it here a historical collection of entries about performance and KNIME