Random Forests Questions

sereen · April 17, 2016, 2:18pm

Hi, i'm new to both machine learning and knime. I'm trying to work with random forest nodes, but i don't quite understand what is the difference between the Tree Ensemble Learner, the Random Forest learner, and RandomForest 3.7 (Weka). and why should i use one but not the other?

Also, the dataset i'm working with is extremely unblanaced (94% of the records belong to one class and 6% only to the other class), when i try the randomforest learner after (Equal Size Sampling) i get the following results:

Row ID	TruePositives	FalsePositives	TrueNegatives	FalseNegatives	Recall	Precision
1	52	11	75	24	0.684	0.825
0	75	24	52	11	0.872	0.758

While when I run the model without equal size sampling, this is what i get:

Row ID	TruePositives	FalsePositives	TrueNegatives	FalseNegatives	Recall	Precision
1	7	8	1562	73	0.087	0.467
0	1562	73	7	8	0.995	0.955

So, how the results with equal size sampling are better that the results without it, although my understanding of random forests is that it deals with unbalanced data.

Appreciate your help, thanks

Sereen

nemad · May 3, 2016, 10:56am

Hello Sereen,

to answer your first question, the Random Forest Learner provides a simpler interface to the code of the Tree Ensemble Learner which has a rather complex dialog with many options to tweak your final model. But KNIME not only supports its own algorithms but integrates other machine learning packages like for example Weka which in turn provides an own implementation of the random forest algorithm.

If you are just starting to use KNIME and machine learning in general, we would recommend you to use the Random Forest Learner because it is less complex than the Tree Ensemble Learner and still provides more options than the Random Forest 3.7 (Weka) node (e.g. you can choose different split criteria than Gini Index).

The handling of unbalanced data in random forest depends on the implementation of the algorithm. Our implementation do currently not include any measures to counteract unbalanced classes and I can not promise that this will change in the near future. But KNIME provides other means to deal with such data, one of which you already used, the equals size sampling. Another option could be the SMOTE node which oversamples the minority class by adding artificial rows.

Cheers

nemad

swebb · May 3, 2016, 2:36pm

You can build a balanced forest by setting up a loop to do sampling, build a single tree, extract the PMML model and add to the PMML ensemble.

You may find some more detailed inspiration here: https://tech.knime.org/forum/knime-general/pmml-ensemble

Cheers

Sam