For some reason, when performing a parameter optimization loop for both a random forest and a single decision tree, the best result for the random forest is significantly worse than the one for the decision tree (AUC=0.506 and 0.789 respectively). Is it possible that there are such few predictive variables in my dataset, that a random forest randomly selects unpredictive variables most of the time?
if you say the random forest is worse, do you mean it has a worse AUC when predicting the validation set? Which parameters are you optimizing for both algorithms? Maybe you are just overfitting?
For both algorithms I’ve optimised the number of trees and min node size (random forest) and min node size (decision trees) on a validation set. The AUC that I’ve mentioned is on another test set, using the optimal parameters for both techniques.
This sounds pretty strange. One guess could be that you have some restrictions in the Random forests or some overfitting in the Decision Tree. You should check if you do really use the same data files in both cases.
Maybe you could try and see if you could benchmark your case with this AutoML workflow:
You could force H2O to just consider Tree algorithms and see what the result is.