How to get the variable importance from the Random Forest model?

lyc0826 · September 17, 2015, 3:18am

How to get the variable importance from the Random Forest algorithm? Being able to calculate the variable importance is the merit of the Random Forest, but it seems in KNIME this critical function is missing.

Iris · September 17, 2015, 9:46am

Hi,

the KNIME Tree Ensemble node offers a second outport which gives you details about the variable importance. Here you can see howoften a variable was used for building a decision three at the first second or third level. As a measure for variable importance, divide the splits with its candidate and sum the three.

You can also take a look at the white paper Seven Techniques for Data Dimensionality Reduction (https://www.knime.org/white-papers) where this technique is explained as well.

Best, Iris

sergi_ortiz · August 4, 2016, 10:33am

Hi,

Regarding this issue I am in the same situation. I mean, I tried to extract the most valuable variables from node "Random Forest Learner" but seems there are not trivial way to do so. I tried to use, as suggested by Iris, the node "Tree ensemble Model Extract", the node "Tree ensemble statistics" but none of these guys show any list of variable by means of its weight on the model.

I would appreciate any help,

Cheers!

Sergi

marco_ghislanzoni · August 4, 2016, 11:09am

Hi Sergi,

I believe Iris explained it properly above. The importance of the variables can be devised directly out of the Tree Ensemble Learner node. There is no need to use other nodes like the one you mention.

The attached workflow should help clarifying how this can be done.

Cheers,
Marco.

variable_importance_for_random_forest_forum.knwf

nemad · August 5, 2016, 2:53pm

Hi,

I guess you are referring to the variable importance measure suggested by the authors of the Random Forest algorithm and you are right the Random Forest implementation in the KNIME AP does currently not support this feature.

But one of the beautiful sides of the KNIME AP is that you can quite easily build a workflow that does the same. At this year's KNIME Summit Dean Abbott gave a great talk about how to do exactly that using a KNIME workflow (

).

The randomization he speaks of is very similar to what the authors of the Random Forest use in their variable importance measure.

Cheers,

nemad

paolotamag · March 10, 2020, 3:56pm

Great Marco!
Please share this workflow on the KNIME Hub! We need it there!
Cheers
Paolo

Alice_Krebs · August 4, 2020, 12:46pm

Saw that the link is gone, these are the slides that nemad is talking about:

DemandEngineer · November 6, 2020, 10:35pm

hoping there is an update… Is there a node that displays Mean Decrease Gini… to understand variable importance?

PacoMasip · May 20, 2021, 8:59pm

Hi,
A question if some of the input variables are heavily correlated then the importance value calculated will not be misleading?. As an example imagine two variable fully correlated “1” then variable importance of those variable will be half. If so, should not make sense to correct by including correlation
Thanks
Paco

kienerj · May 25, 2021, 7:22am

You should filter correlated variables before creating the model.

PacoMasip · May 25, 2021, 7:38am

Unfortunately I do not want to loose information

kienerj · May 25, 2021, 10:38am

If the correlation is 1 you are losing 0 information. Anyway correlated features mess with decision trees as well as other algorithms so it’s more or less at least good practice if not a requirement to remove them before building a model.

PacoMasip · May 25, 2021, 11:09am

Some comments:
1.-The objective is understanding the variable importance ranking independently of the correlation
2.-It is possible to have quite correlated inputs with different effect on the output so better not to remove them
3.-Tress or Random Forest are algorithms very insensitive to correlated inputs compare with other methods

kienerj · May 25, 2021, 3:18pm

Correlated features inherently mess with the importance as you pointed out. Either you remove them or find a way to correct for that (not sure if even possible to to the inherent randomness of RF).

3.-Tress or Random Forest are algorithms very insensitive to correlated inputs compare with other methods

Compared to other methods, yes. But depends how many correlated features and how much they are correlated. If you have too many, the splits will just happen too often on essentially the same “data” and hence not adding any value and reducing the effect of other possibly more important features. Depends on your goal too. For prediction better to remove as many as possible. For sake of speed and model simplicity (=less over fitting).

If you are into interpret ability, look into the SHAP nodes or in general SHAP.. For sure better way than Feature importance IMHO.

jorgemartcaam · August 11, 2021, 1:38pm

Hi!

Any update on this?

Best regards.

DemandEngineer · August 11, 2021, 2:02pm

Sometimes it helps to talk about the specifics… for example if I include variables that are derivative measurements of another variable including those variables seems to exaggerate the importance as they are just derivative measurements. Have to be careful that the variables aren’t so correlated that they are both really influenced by the same thing.

DemandEngineer · August 11, 2021, 2:04pm

Unfortunately, I have not seen anything new that helps

MAAbdullah47 · April 10, 2023, 10:29am

It gives a problem in R snippit

R home e:\KNIME is invalid , please advise.

ScottF · April 10, 2023, 7:19pm

Sounds like an R configuration problem - in particular, your R Home settings need adjustment.

Please check the documentation about that: KNIME Interactive R Statistics Integration Installation Guide