Big Data join

mlauber71 · May 13, 2021, 10:35am

@alexandruradu welcome to the KNIME forum.

A few remarks:

the local big data environment in my opinion is more for educational purposes than for real production tasks
then you might want to try Hive instead of Spark (Being Lazy is Useful — Lazy Evaluation in Spark | by Lakshay Arora | Analytics Vidhya | Medium) since Spark will just try to load everything into RAM and process it there while Hive has the ability to use the hard drive, that might work better
you might want to try the (new) KNIME Joiner node that might also be suitable for this tasks
You could try and combine this with the (new) columnar storage (1|2) feature along with other performance enhancement measures.

If you could use some sort of loop very much depends on the nature of your data. If the table that has to be joined is significantly smaller that the left a (chunk) loop might be an option. But this depends on the data. What kind of duplicates would you expect to occur and how would you handle them?