Hi, i'm new to both machine learning and knime. I'm trying to work with random forest nodes, but i don't quite understand what is the difference between the Tree Ensemble Learner, the Random Forest learner, and RandomForest 3.7 (Weka). and why should i use one but not the other?
Also, the dataset i'm working with is extremely unblanaced (94% of the records belong to one class and 6% only to the other class), when i try the randomforest learner after (Equal Size Sampling) i get the following results:
Row ID
TruePositives
FalsePositives
TrueNegatives
FalseNegatives
Recall
Precision
1
52
11
75
24
0.684
0.825
0
75
24
52
11
0.872
0.758
While when I run the model without equal size sampling, this is what i get:
Row ID
TruePositives
FalsePositives
TrueNegatives
FalseNegatives
Recall
Precision
1
7
8
1562
73
0.087
0.467
0
1562
73
7
8
0.995
0.955
So, how the results with equal size sampling are better that the results without it, although my understanding of random forests is that it deals with unbalanced data.
to answer your first question, the Random Forest Learner provides a simpler interface to the code of the Tree Ensemble Learner which has a rather complex dialog with many options to tweak your final model. But KNIME not only supports its own algorithms but integrates other machine learning packages like for example Weka which in turn provides an own implementation of the random forest algorithm.
If you are just starting to use KNIME and machine learning in general, we would recommend you to use the Random Forest Learner because it is less complex than the Tree Ensemble Learner and still provides more options than the Random Forest 3.7 (Weka) node (e.g. you can choose different split criteria than Gini Index).
The handling of unbalanced data in random forest depends on the implementation of the algorithm. Our implementation do currently not include any measures to counteract unbalanced classes and I can not promise that this will change in the near future. But KNIME provides other means to deal with such data, one of which you already used, the equals size sampling. Another option could be the SMOTE node which oversamples the minority class by adding artificial rows.