Parallelized Chunk Loops - Improvement Ideas


a common use case is to run a model on chunks of data (e.g. one model by country / product group), and we often have a large number of such a chunks to handle.  The easy implementation is with the Group Loops - wich splits the chunks and sequentially runs the model on each.
We have now tried using Parallel Chunk Loop to parallelize the execution.

That seems to work - but here are a few observations:

- RowIDs are not made unique in the Parallel Chunk Loop End:
I have branches where I add new rows to my dataset, and this results in rowIDs being used multiple times - making the Parallel Chunk Loop End node fail when combining the data.  This can be fixed by using the RowID node and appending the chunk_index to the RowID.  However - it would be nice if the Chunk Loop End node would be smart enough to do that by default...

- Parallelization of R executions:  Is that working?
We often run models in R Snippets - and these can be quite time-consuming.
Using the Parallel Chunk Loop  - I however do not get the expected benefits in this case.  I would have expected that when running RSnippet nodes in the branches - I would see multiple RTerm / R processes getting launched. But I can only see one (on Windows) running on one core.  Hence it feels as if parallelization is not working with the R integration.  Is there a way to improve that?


As an option in the loop-end it would be indeed an interesting addition to concatenate the chunk_index. To do that by default would less then helpfull however.

In addition to Lorenz's excellent suggestions, I'd like to suggest another improvement:

A major drawback of the current implementation of parallel chunks is that the number of chunks is determined at the loop start. This is problematic if the input table contains a few rows that take significantly longer to process than other rows: in this case, you often end up in a situation where most chunks are finished and the loop end is only waiting for the last chunk to be finished. So once the number of still running chunks is lower than the number of available cores, the CPU usage is suboptimal.

A solution to this could be to create many more chunks than there are processors, and to always run only n of those at a time. Whenever one chunk is finished processing, a new one can be started.

There can still be cases where a single chunk can be left over processing, but I think a higher chunk number together with some thread management could improve the situation.

Please let me know what you think about that.