Random Forest - Random Forest column split and candidate counts

dursundelen · July 21, 2018, 2:20pm

Hi,

In a trained Random Forest, the Attribute Statistics include #splits (level 0), #splits (level 1), #splits (level 2), #candidates (level 0), #candidates (level 1), #candidates (level 2).

It is easy to understand the #splits (level 0) and #candidates (level 0)

Does anybody know how the following numbers are determined/calculated?
#splits (level 1) and #candidates (level 1)
#splits (level 2) and #candidates (level 2)

Thank you!

nemad · July 23, 2018, 9:10am

Hi,
Conceptually those numbers are calculated in the same way like those for the root split, just for the second and third splits in the tree.
The candidate number indicates how often the attribute was in the attribute sample used to find the split and the number of splits is the number of times the attribute won the split.
Note that the numbers roughly double with each level because the second level contains two splits and the third level eight.

Cheers, nemad

JoelMenendez · August 11, 2020, 4:29pm

Hi Nemad,

A follow-up of the same question. With the Random Forest Learner node, we can derive variable importance by using the # of times an attribute “was a candidate” vs how many times it “won the split”, but we can’t see where the split is.

For example, if a continuous variable “price” is, at level 0, the most important attribute, how can I know at which price the split is happening (if at all)? I know Random Forests are ensembles of trees, so even if price “wins” 10/10 splits, odds are it “wins” the splits at different price points; are these points of split also averaged?

Thanks a lot in advance.

Best,
Joel

nemad · August 12, 2020, 7:05am

Hello @JoelMenendez,

I don’t understand how the split point would affect the variable importance but if you want to see the split point, you can go to the node’s view where the splits are displayed.
You can also extract the individual decision trees as PMML which is a special kind of XML and further process it to get this kind of information.

Best,
Adrian

JoelMenendez · August 12, 2020, 6:38pm

Hi again. I know that the split point is not related to the variable weight; it was an and question: which variable has the biggest weight in the RF model and where is said split (if it even applies to Random Forests; I don’t mean split points for each individual tree, but I’m asking if these split points average out to an “average split point”)?.

Like I mentioned above, if I want to know, for example, a price point at which a “no buy” result probability is higher vs inferior price points, is that something that you can get from a Random Forest? Thanks again!

nemad · August 17, 2020, 7:43am

There might be cases where the same split point is used over and over e.g. if you only have very few values that are far apart but in general that seems unlikely.
If you are interested in this kind of rule, then you are probably better off with a single decision tree because the way random forests are built (row and attribute sampling) makes them hard to interpret in this way.