I am using RF for classification and I have some doubts:
How do I use the Loop nodes to optimize the hyperparameter “number of variables used at each split” ?
How do I obtain the variable importance table/graph? I came across this link (How to get the variable importance from the Random Forest model?), but it’s from 2016, uses the Tree Ensemble Learner instead of the Random Forst and it is a bit confusing to me.
How do I get 95% confidence intervals for the final AUC?
The output of Random Forest and Tree Ensemble are identical, as the RF nodes are just an easier version of the TE nodes. Considering that you can use the solution in the forum link that you posted (How to get the variable importance from the Random Forest model? I also updated the link to the slides there).
Thank you very much, @Alice_Krebs. In the case of high-dimensional data (p>n), do you know of an example that incorporates Feature Selection in the cross-validation process?
The example workflow you provided earlier didn’t include the data and I couldn’t run it, so it’s always a bit challenging to give concrete advice, sorry. There are some examples that might help you:
I attach a workflow to predict the class A from a binary classification using Random Forests. Letting aside the poor performance, which may be due to errors in the data, I do not understand why the AUC is so poor (<0.15!) and yet the Scorer shows good metrics from the confusion matrix (accuracy 0.80, etc.). Is there anything incorrectly specified from the X-aggregator onwards?
Since there isn’t data included in the workflow, it’s a little tricky to diagnose this. But I used one of your previously posted small datasets to investigate.
In short, I think the problem is that you’re not plotting the ROC curve correctly. First you need to check the option to append individual class probabilities in your predictor node. Then, make sure you’re setting the appropriate class column (not prediction) and are plotting the probabilities (and not the confidence, or the predictors themselves). Something like this:
Thank you very much, @ScottF. I was not checking the “Append individual class probabilities” and now results look much more reasonable and are in agreement between the ROC curve and Scorer nodes.