Decision tree variable selection

mauuuuu5 · July 22, 2016, 5:45pm

Hi everybody I have a question regarding the variable selection in a decision three, I know before hand that when using the "Decision Tree Learner" the user can modify the parameters and control in some extend the variables that build the tree. However want to know why the decision tree discarded or dropped some variables.

For instance I want to predict Y and the independ variables are X1, X2, X3, X4... and the tree that I built just selected the X1 and X4 variables and dropped X2 and X3.

One idea is to make a correlation matrix, but X2 is a string variable.

I wonder if I can use the workflow used by Iris in this post "Variable Importance in Prediction (Classification or Regression) Molels" to see importance of that variable.

Thank you

marco_ghislanzoni · July 23, 2016, 11:39am

Hi,

in the most simple terms, the reason a decision tree "discards" a variable is that the particular variable has little relevance to the decision itself. In other words, having or not having that variable makes a very small difference or no difference at all when it comes to the decision to be taken. It is more like a "background noise" that is being removed to focus instead on the variables that are stronger discriminants to the decision.

Take for example the famous Iris data set. If you run it through a KNIME Decision Tree Learner with standard configuration you will note that only Petal.Width and Petal.Length are used in the resulting decision tree. The other two factors, Sepal.Width and Sepal.Length, have been discarded by the algorithm.

If you untick the Reduced Error Pruning option and re-run the Learner, you will see that now Sepal.Length appears deep in the tree, but that doesn't dramatically add to the Accuracy of your decision model (0.96 before, 0.98 now). Your decision model was already very good without the Sepal.Width variable and most of the accuracy of the decision still comes from the original Petal.Width and Petal.Length variables.

As always there is a trade-off between having to deal with an extra variable and the increase in accuracy you get if you do it. For a simple model with few data this may not be an issue, but for complex models with very large data sets being able to "discard" some unuseful variables may make a substantial difference.

Another element to consider is the risk of overfitting, to which pruning is one of the possible answers. While this is a very important topic, it's a bit too technical to cover it here.

This said, if you want to evaluate the importance of a variable for your decision, you can try to remove that variable from the input to the Decision Tree Learner and see what effect it has on the accuracy of the model when employed for a prediction.

The example from Iris (the KNIME Iris, not the dataset) you refer to uses a slightly different approach to gain the same understanding of how important a variable is to a decsion. Instead of removing a variable, you can re-shuffle it. Basically you apply random permutations within the column that contains that variable so to "break" the link between that variable and the decision outcome.

In the case of the Iris data set, you can try re-shuffle the Petal.Width column, leaving the rest unchanged (original), and see what happens to the accuracy of the decision. Then you do the same with the Petal.Length column and so on. The variable that has the largest (negative) impact on the accuracy is the most important one and so on.

Sorry for a lengthy answer to a simple questions, hope it helps!

Cheers,
Marco.

mauuuuu5 · July 23, 2016, 3:36pm

Thank you Marco, for the nice and great explanation