Streaming API and multi-threading: Tree Ensemble Predictor

I'm creating a workflow where I want to predict multiple classes on a rather large dataset. I thought streaming API could be very handy here. However I soon realized there seems to be an issue at least with some nodes, in my case the Tree Ensemble Predictor node.

The issue is the Tree Ensemble Predictor node runs single-threaded in streaming mode (that's an assumption taken from CPU usage) and hence Streaming is actually a lot slower than normal execution

Is this a general issue with streaming API? I can see it makes some sense as you don't want 1 node to take up 100% of resources but it's also a huge drawback. Ideally the executor would detect on it's own which node needs more resources (CPU) but as a simple solution some sort of setting could be provided (alllow multi-threading).

 

 


 

Hi,

Yes, completly correct. The streaming executor isn't parallelized yet (although the API allows for it). This is one of the reasons why the Streaming Executor is still in Labs.

The big benefit of the streaming executor is when you have a long chain of nodes and you stream data through them as it goes (this is then "pipeline parallel"). Data parallelization is something on top and it's on the roadmap but not currently available.

- Bernd