Feature Selection for Random Forest Classification

Haroon_954 · December 25, 2021, 4:54am

Hello Everyone,

I have used Random forest classification for the prediction of accident severity (Class Variable) for the data available with me for road traffic accidents, Now I want to find which features are important in terms of predicting accidents severity,

I want to know how I should find the features that are important (scoring of variables)?

I am attaching the data set and KNIME workflow for reference?

Any help will be highly appreciated,

Thanks in Advance

M-2 Data KNIME 1.xlsx (132.6 KB)

RF Workflow.knwf (17.0 KB)

How to get the variable importance from the Random Forest algorithm?

Iris · December 25, 2021, 12:56pm

Hi

did you check out this topic, there is a nice example workflow for your problem attached How to get the variable importance from the Random Forest model?

Haroon_954 · December 25, 2021, 3:11pm

Thanks Iris, I studied it but didn’t get it. Need some more explanation or workflow.

Waiting for your kind response

Thank You Iris

mlauber71 · December 25, 2021, 10:11pm

@Haroon_954 this article about automated machine learning also has a passage about variable importance, currently if you want to have the power of feature importance with H2O Automl you would have to use R or Python.

Also most H2O.ai model nodes have a generic feature importance output (like GBM)

If you want more advanced explanations you could look at examples like this, using a special global feature importance component.

If you go into Python there you could use the powerful XGBoost with feature importance (the generic KNIME integration does not have that):

Haroon_954 · December 29, 2021, 6:50am

Thank you so much mlauber71 for your kind and detailed reply to my query, I definitely will look to this solution you mentioned, hopefully, it will solve my problem**

aworker · December 29, 2021, 10:25am

Dear @Haroon_954

Complementary to the information provided by @Iris and @mlauber71, I have added to your workflow a simple way of doing variable selection based on a Decision Tree. I guess what you need too is to understand how a DT in particular and a RF in general do variable selection but for this, I believe it is good to start with a Decision Tree since RF are made of Decision Trees.

In the workflow I have:

Taken your data and trained a DT using your data splitting (70% / 30%)
Extracted Variables from the DT rules using the -Decision to Tree Ruleset- node
Counted how many times every variable was employed by the DT. Usually people determine DT variable importance based on at which DT branch level it was used, the highest the most important. It turns out that there is a strong correlation between variable branch level and eventually # of variable occurrences in the DT rule set. Thus you can determine variable importance based on variable occurrence in DT rule set too
Filtered out variables with occurrence less than 15. This threshold here is set arbitrary but could be estimated too. I’m not adding threshold estimation here to provide you with a first simple solution that you could easily understand to begin with.

Using this variable selection, you end up with 11 most important variables in terms of rule set occurrence:

In a second workflow, I filter IN only selected variables to be used to train your RF classifier.

From Scorer results, you can see that using only these 11 variables, you get same statistics performance as when you use the whole 20.

The statistics obtained by the RF remain similar before:

and after variable selection:

This approach can be extended to do Variable Selection using Random Forest instead of a DT but my aim here is to provide you with an example on how to easily achieve variable selection based on a DT, how to extract this information from the DT tree and how to reuse it in a RF.

The whole workflow is here below:

20211229 Pikairos Feature Selection for Random Forest Classification.knwf (877.1 KB)

Hope this minimalist example helps you to understand Variable Selection by DT.

Best

Ael

Haroon_954 · December 31, 2021, 6:49am

Dear aworker, I have no words to thank you for such a detailed reply, Exactly this is what I need to perform on my data.

Thanks Again aworker, Have a beautiful year ahead, wishes you more and more success

aworker · December 31, 2021, 8:36am

Dear @Haroon_954

Thanks a lot for your kind words and glad to help anytime

Plenty of success too and best wishes for 2022 !

Ael

devrajr · June 4, 2022, 10:07am

Hey, @aworker, Thanks for this detailed explanation for getting important features for the Decision Tree. It was really helpful. I am trying to use and compare different models on the same dataset, and I have used the decision tree and got features that are important for that model. I am also trying to use SVM and Naive Bayes. Could you help me know if there is a way to find important features for Naive Bayes and SVM?

aworker · June 4, 2022, 1:49pm

Hi @devrajr

Thanks for your comments and compliments.
It is good you found this old thread with the solution I posted for RF variable selection and that you tried it.
However, you are posting here in parallel to your more recent post related to the same question:

I will hence reply to your question in your more recent post to avoid thread leakage in this one

My answers in the other one will follow soon

Best
Ael

Haroon_954 · July 24, 2022, 7:04am

Thanks @aworker for the explanation of my question

system · January 22, 2023, 7:04pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.