I’m a new KNIME user and I’m looking for a way to perform permutation feature importance for a linear regression model, similar to what is done for classification models in the Global Feature Importance node using the Permutation Feature Importance option.
The performance score I use is RMSE, and I would like to be able to run the permutation feature importance on a trained model but use the validation data. This capability was available in Azure Machine Learning Studio classic, but I can’t find anything similar for linear regression in KNIME.
But maybe I’m missing the obvious somewhere!
Thanks in advance for any suggestions.
Hi @dfbenton and welcome to KNIME Community Forum,
The concept of permuting features is straightforward. You can easily implement it. You need to shuffle values of each feature in a loop and check the difference in the model performance. The higher increase in error rate, the more important the feature is.
Here is an example:
41179.knwf (56.8 KB)
Maybe a silly question, but why use feature importance when linear regression supplies the coefficients and statistical likelihood of features being impactful to the model?
Thank you, @armingrudd for the workflow! The nodes beyond the Regression Predictor are new to me, but I think I can follow it.
Referencing your example, is there a way to shuffle the 3 predictors before the Linear Regression Learner node? This would allow a new model to be built for each subset of 2 predictors. Then the performance metrics computed on the validation data can be compared.
Thanks again for sharing the workflow!
You can use Column Splitter and Shuffle to do so. Then you can use Column Appender to append the split columns.
@victor_palacios, in predictive modeling (as opposed to explanatory modeling), I build a linear regression model using the training data. Then I measure its predictive power of that model by running the validation data (the hold-out set) through it and calculating the resulting RMSE (or other performance measure).
Since I’m more interested in being able to predict future values, I want to do permutation feature importance to see how the different variables contribute to lowering the RMSE from the validation data.
The p-values given in the Linear Regression Learner node are based on the training data (this is great - Azure ML Studio did not give these in the output). I’m looking for additional metrics that can give me insight on how the regression model will perform on the validation data, which has never seen the model.
Hope this makes sense
That’s a great point. I forgot that like the Random Forest out-of-bag estimates, LR’s statistical properties come from the training data.
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.