I have had a look through the forum and the same issue I have crops up occasionally, but I have yet to see a solution.
I have a table that I am passing to a Python node to do some ‘stuff’. The ‘stuff’ will only run on a single thread inside the python node, so I am running the python node in parallel using the Parallel Chunk Start and End nodes. This for the most part works fine. However, for some rows, the ‘stuff’ may take a long time. As the parallel nodes pre-chunk the data, there are rows awaiting processing on say chunk 1, which is slowly churning its way through a slow row, when chunks 2-10 have already finished.
So, rather than pre-chunk the data, is there a way to pass rows from the initial table to the next available parallel chunk? I thought about using variables to pass new rowIDs to inside a parallel loop but couldn’t get it to work…
So, rather than pre-chunk the data, is there a way to pass rows from the initial table to the next available parallel chunk? I thought about using variables to pass new rowIDs to inside a parallel loop but couldn’t get it to work…
I am not sure if I understand this correctly. I assume you want to still have a fixed set of virtual branches (as created by the Parallel Chunk node) which then automatically pull some data and as soon as one of these branches is done it will pull new data. This is not possible, unfortunately. The idea of having a number of “worker” branches that process data independently of each other is nice though.
If the python code is not too complex, it would maybe be worth the effort to recreate the transformation with native KNIME nodes and then use the Streaming extension.
If you want to stick to python, it might make sense to have a look at the multiprocessing library instead of using the Parallel Chunk nodes. This will create n processes (each with at least one thread attached) that then run your code. My colleague Davin has provided an example of how to set this up here:
The other forum user also mentioned joblib but I haven’t tried this one.