In Ensemble Tree node, I would like to use it for variable selection, may I know the difference between the following options:
Use same set of attributes for each tree describes that the attributes are sampled once for each tree and this sample is then used to construct the tree.
Use different set of attributes for each tree node samples a different set of candidate attributes in each of the tree nodes from which the optimal one is chosen to perform the split.
What is the implication of these two options? Thanks
maybe a little example can illustrate the difference.
Let’s assume we want to decide if the weather is suitable to play tennis and we have the three variables temperature, sunny, windy.
The first option draws a sample from those (e.g. [temperature, windy]) and uses this sample to create a decision tree ignoring all other variables.
The second option draws such a sample for each split inside of an individual tree, so the first split may be calculated using temperature and windy, while the second split may be calculated using sunny and windy.
This technique is used in random forests to increase the diversity of the individual trees.
Thanks for clarification.
However, I am not quite clear about their prediction power, i.e. best variables selected, of these two methods, any rules for picking option 1 or 2. Or I need to try both to see which one got better results?
Option 2 usually gives the best results in terms of prediction power (e.g. accuracy) because it results in more diverse trees. In a way this is the secret ingredient that makes random forests work so well.
Many thanks for your explanation.
For Option 2, any drawback? More resources required?
Is there any advantages of using Option 1? Good for small sample size?
Thanks in advance.
No, there are no drawbacks or at least no recognizable drawbacks.
Not that I am aware of.