Random Forests Questions

Hi, i'm new to both machine learning and knime. I'm trying to work with random forest nodes, but i don't quite understand what is the difference between the Tree Ensemble Learner,  the Random Forest learner, and RandomForest 3.7 (Weka). and why should i use one but not the other?

Also, the dataset i'm working with is extremely unblanaced (94% of the records belong to one class and 6% only to the other class), when i try the randomforest learner after (Equal Size Sampling) i get the following results:

Row ID TruePositives FalsePositives TrueNegatives FalseNegatives Recall Precision
1 52 11 75 24 0.684 0.825
0 75 24 52 11 0.872 0.758


While when I run the model without equal size sampling, this is what i get:

Row ID TruePositives FalsePositives TrueNegatives FalseNegatives Recall Precision
1 7 8 1562 73 0.087 0.467
0 1562 73 7 8 0.995 0.955


So, how the results with equal size sampling are better that the results without it, although my understanding of random forests is that it deals with unbalanced data.


Appreciate your help, thanks


Hello Sereen,

to answer your first question, the Random Forest Learner provides a simpler interface to the code of the Tree Ensemble Learner which has a rather complex dialog with many options to tweak your final model. But KNIME not only supports its own algorithms but integrates other machine learning packages like for example Weka which in turn provides an own implementation of the random forest algorithm.

If you are just starting to use KNIME and machine learning in general, we would recommend you to use the Random Forest Learner because it is less complex than the Tree Ensemble Learner and still provides more options than the Random Forest 3.7 (Weka) node (e.g. you can choose different split criteria than Gini Index).

The handling of unbalanced data in random forest depends on the implementation of the algorithm. Our implementation do currently not include any measures to counteract unbalanced classes and I can not promise that this will change in the near future. But KNIME provides other means to deal with such data, one of which you already used, the equals size sampling. Another option could be the SMOTE node which oversamples the minority class by adding artificial rows.



1 Like

You can build a balanced forest by setting up a loop to do sampling, build a single tree, extract the PMML model and add to the PMML ensemble.

You may find some more detailed inspiration here: https://tech.knime.org/forum/knime-general/pmml-ensemble