I've tried to put together a workflow that can build a 'balanced random forest'. I have an imbalanced data set, lets say 80% class a and 20% class b.
I want a random under sampling approach where I take 70% of the class b structures and then sample randomly as many of the class a structure. So if I had 100 class b entries I would sample 70 of class b and then 70 of class a.
I have a counting loop start to do 100 iterations, I then do my sampling. I use the tree ensemble learner and build one random tree. I then extract the model, use the cell to PMML and end the loop with the PMML Ensemble Loop End.
The PMML Ensemble Loop End node is taking a significant ammount of time to finish the last iteration (> 20 minutes so far). Is this normal behaviour? The dialog progress states "Executing - Copying input object at port 1".
There is still 2.5g of the heap space left unused.
Workflow screenshot attached.
On a related node, when processing a loop the max iterations starts at 1 and the current iteration starts from 0. So for a loop where the maxIterations is 10 the final iteration is 9, this is confusing to some of the users here but not really a problem.
this is definitely a interesting combination of nodes.
First the cell to pmml node is not necessary here. You can directly collect the outputs of the Tree Ensemble Model Extract with a standard Loop End node and then use the Table to PMML Ensemble to create the PMML Model for the Predictor.
On the other hand, you configure the Tree Ensemble Learner to output only one Tree? Is the output then different to a normal Decision Tree?
Thank you for reporting the other problems, we will try to fix this as soon as possible.
Ah magic, I clearly wasn't paying enough attention. I've updated the workflow to use the table to PMML ensemble post loop.
What I'm after is a random forest that uses a balanced bagging technique. So if the sample size per tree in the forst was 70%, instead of taking a random 70% it should take for example 70% of class A structures (lets say 30 rows) and then take another 30 rows from class b. Build a tree as per the random forest approach (random descriptor sample for each node), rinse and repeat.
I'm effectively using the Tree Ensemble Learner as a "Random Tree" builder and not a decision tree builder.
with the 2.8.1 release we implemented some fixes for these nodes. The nodes are now cancelable and have a report the progress.
We were not able to reproduce the long duration of the PMML Ensemble Loop End node. I tried it with 1000 models but it still was in a reasonable time, only minutes.