# How to get the variable importance from the Random Forest model?

How to get the variable importance from the Random Forest algorithm? Being able to calculate the variable importance is the merit of the Random Forest, but it seems in KNIME this critical function is missing.

1 Like

Hi,

the KNIME Tree Ensemble node offers a second outport which gives you details about the variable importance. Here you can see howoften a variable was used for building a decision three at the first second or third level. As a measure for variable importance, divide the splits with its candidate and sum the three.

You can also take a look at the white paper Seven Techniques for Data Dimensionality Reduction (https://www.knime.org/white-papers) where this technique is explained as well.

Best, Iris

1 Like

Hi,

Regarding this issue I am in the same situation. I mean, I tried to extract the most valuable variables from node "Random Forest Learner" but seems there are not trivial way to do so. I tried to use, as suggested by Iris, the node "Tree ensemble Model Extract", the node "Tree ensemble statistics" but none of these guys show any list of variable by means of its weight on the model.

I would appreciate any help,

Cheers!

Sergi

Hi Sergi,

I believe Iris explained it properly above. The importance of the variables can be devised directly out of the Tree Ensemble Learner node. There is no need to use other nodes like the one you mention.

The attached workflow should help clarifying how this can be done.

Cheers,
Marco.

4 Likes

Hi,

I guess you are referring to the variable importance measure suggested by the authors of the Random Forest algorithm and you are right the Random Forest implementation in the KNIME AP does currently not support this feature.

But one of the beautiful sides of the KNIME AP is that you can quite easily build a workflow that does the same. At this year's KNIME Summit Dean Abbott gave a great talk about how to do exactly that using a KNIME workflow (

).

The randomization he speaks of is very similar to what the authors of the Random Forest use in their variable importance measure.

Cheers,

Great Marco!
Please share this workflow on the KNIME Hub! We need it there!
Cheers
Paolo

Saw that the link is gone, these are the slides that nemad is talking about:

hoping there is an update… Is there a node that displays Mean Decrease Gini… to understand variable importance?

1 Like

Hi,
A question if some of the input variables are heavily correlated then the importance value calculated will not be misleading?. As an example imagine two variable fully correlated “1” then variable importance of those variable will be half. If so, should not make sense to correct by including correlation
Thanks
Paco

You should filter correlated variables before creating the model.

Unfortunately I do not want to loose information

If the correlation is 1 you are losing 0 information. Anyway correlated features mess with decision trees as well as other algorithms so it’s more or less at least good practice if not a requirement to remove them before building a model.

1.-The objective is understanding the variable importance ranking independently of the correlation
2.-It is possible to have quite correlated inputs with different effect on the output so better not to remove them
3.-Tress or Random Forest are algorithms very insensitive to correlated inputs compare with other methods

Correlated features inherently mess with the importance as you pointed out. Either you remove them or find a way to correct for that (not sure if even possible to to the inherent randomness of RF).

3.-Tress or Random Forest are algorithms very insensitive to correlated inputs compare with other methods

Compared to other methods, yes. But depends how many correlated features and how much they are correlated. If you have too many, the splits will just happen too often on essentially the same “data” and hence not adding any value and reducing the effect of other possibly more important features. Depends on your goal too. For prediction better to remove as many as possible. For sake of speed and model simplicity (=less over fitting).

If you are into interpret ability, look into the SHAP nodes or in general SHAP.. For sure better way than Feature importance IMHO.

Hi!

Any update on this?

Best regards.

Sometimes it helps to talk about the specifics… for example if I include variables that are derivative measurements of another variable including those variables seems to exaggerate the importance as they are just derivative measurements. Have to be careful that the variables aren’t so correlated that they are both really influenced by the same thing.

Unfortunately, I have not seen anything new that helps