Tree Ensemble Learner - variable importance?

In the Tree Ensemble Learner nodes, is there a good way to get variable importance?

I've used permutation tests to do this in the past...

I'm note clear what the attribute stats mean: Level 0,1 & 2? I set max tree depth as 10...



You want to look at the attribute statistics table. It's telling you how often an idividual attribute was used at different levels of the tree (only top three). See this output table for the "Spam" dataset.

There are 57 attributes. The table is sorted according to "#splits (level 0)". The attributes 'char_freq_$' and 'char_freq_!' are most often used in the root node of the respective trees. There are a total 25k trees learned here (I overdid it, yes!) with classic/default random forest settings: different attribute set for each tree node and square root number of attributes to pick from. That means about 57^(1/2) = ~8 attributes to choose from for each tree node. Each of the 57 attributes was on average 25k/8 = ~3100 times in the attribute set for the root node. The concrete count is in column "#candidates (level 0)".

The attribute 'char_freq_$' was 3049 times in this candidate set and in 2984 times it was picked as the best ... so that is quite discriminitive. Btw, the number of times an attribute is in the candidate set for levels 1, 2, ... (at most) doubles with each level for binary splits. So "#candidates (level 1)" is on average 2 * 25k/8 = ~6200.

So the answer to the question how to get the discriminitive attributes is: Sort by "#splits (level 0)" and take the top n. Be aware that attributes might not be independent. A linear correlation filter node could help here!

Hope that helps,

Thanks, that's very helpful.



Hi all,

is there a way to obtain the entire list of variable used in the tree and not just the first 3 levels?

Thank you,