PMML Ensemble

swebb · August 1, 2013, 6:27pm

Hi

I've tried to put together a workflow that can build a 'balanced random forest'. I have an imbalanced data set, lets say 80% class a and 20% class b.

I want a random under sampling approach where I take 70% of the class b structures and then sample randomly as many of the class a structure. So if I had 100 class b entries I would sample 70 of class b and then 70 of class a.

I have a counting loop start to do 100 iterations, I then do my sampling. I use the tree ensemble learner and build one random tree. I then extract the model, use the cell to PMML and end the loop with the PMML Ensemble Loop End.

The PMML Ensemble Loop End node is taking a significant ammount of time to finish the last iteration (> 20 minutes so far). Is this normal behaviour? The dialog progress states "Executing - Copying input object at port 1".

There is still 2.5g of the heap space left unused.

Workflow screenshot attached.

On a related node, when processing a loop the max iterations starts at 1 and the current iteration starts from 0. So for a loop where the maxIterations is 10 the final iteration is 9, this is confusing to some of the users here but not really a problem.

Cheers

Sam

balancedlooping.png

swebb · August 1, 2013, 6:48pm

It has now completed, estimated time 45minutes.

The PMML Ensemble predictor dialog progress doesn't look like it reports correctly.

Cheers

Sam

Iris · August 9, 2013, 3:41pm

Hi Sam,

this is definitely a interesting combination of nodes.

First the cell to pmml node is not necessary here. You can directly collect the outputs of the Tree Ensemble Model Extract with a standard Loop End node and then use the Table to PMML Ensemble to create the PMML Model for the Predictor.

On the other hand, you configure the Tree Ensemble Learner to output only one Tree? Is the output then different to a normal Decision Tree?

Thank you for reporting the other problems, we will try to fix this as soon as possible.

Cheers, Iris

swebb · August 9, 2013, 7:08pm

Hi Iris

Ah magic, I clearly wasn't paying enough attention. I've updated the workflow to use the table to PMML ensemble post loop.

What I'm after is a random forest that uses a balanced bagging technique. So if the sample size per tree in the forst was 70%, instead of taking a random 70% it should take for example 70% of class A structures (lets say 30 rows) and then take another 30 rows from class b. Build a tree as per the random forest approach (random descriptor sample for each node), rinse and repeat.

I'm effectively using the Tree Ensemble Learner as a "Random Tree" builder and not a decision tree builder.

Many thanks for the response

Sam

Iris · August 22, 2013, 2:04pm

Hi Sam,

with the 2.8.1 release we implemented some fixes for these nodes. The nodes are now cancelable and have a report the progress.

We were not able to reproduce the long duration of the PMML Ensemble Loop End node. I tried it with 1000 models but it still was in a reasonable time, only minutes.

Regards, Iris

swebb · September 4, 2013, 4:24pm

Hi Iris

I have updated to KNIME 2.8.1 but still get slow performance when creating the PMML ensemble model.

Maybe there is something specific about the setup I used?

It took 47 minutes to complete the Table to PMML Ensemble node in the attached workflow (as image). It had 100 models to combine.

I've saved the output table from the loop end and copied the rest of the nodes into a new workflow. Could you let me know if it runs fine your end?

Many thanks

Sam

Iris · September 5, 2013, 1:52pm

Hi Sam,

Thank you for the workflow! This really helped a lot. We now could figure out the problem and will work on a fix for the next bugfixrelase.

Cheers, Iris