Best practices for speed up nested loops with huge data

Hi collegues,

I'm working on a workflow that performs 2 nested loops. The first loop iterates each category list and the second iterate each user that has bought products on that specific category and calculates similarity between a single user of that list with entire list of the user that has bought product on that category. Here a screen of what 'm doing:

So for example If I've got 50 category and for each category I have something like 7000 users well... the execution time is too much high.

What I have done for now for prevent the increase in runtime:

- I have tried to use the database nodes when it was possible for filter data

- I have added a parallel chunk loop node for parallelize more operations

For now these tips were not sufficient for execute the workflow in less than 15 hours...


Any suggestions? Thanks in advice.

- Giulio

Hi Giulio,

Just a couple thoughts:

- Are you filtering all users who didn't buy a product before you enter the second loop?

- The Step_4 metanode sends data to the inner loop that didn't go through the Parallel Chunk Start. Maybe this slows things down a bit?

- What type of operations are performed inside the metanodes? Is it possible to use Streaming execution here? If yes, you could speed things up significantly. To learn more about streaming execution in KNIME, please have a look here:

- A bunch of other tips can be found here:

- If streaming is not possible for the whole workflow, you could try to arrange your workflow so that you have as many streaming-enabled nodes together as possible.

Lastly, nesting two loops of that size will always result in a gigantic workload - 50 categories with 7000 users each results in 350.000 loop iterations. Therefore, another possibility to consider is whether there is any way to arrange your workflow to get the desired result in only one loop.

I hope any of that helps you! :-)