Feature Importance: H2O RFR, Feature Selection, Permutation, XGBoost, SHAP

I’m trying to solve for the qualitative features with the highest impact on values of homes, based on feature analyses of past sales. I will do this sort of analysis separately with quantitative features (square feet, etc.) and then once more with location features (named geocoding + additional location features). In the end pulling all of this together for the final model.

Please take a look at what I’ve done with each of five models and tell me where I went wrong. My biggest questions are in red near the end of each model.

Because the current models have only a portion of the final model data (qualitative), the R2 is low for the time being.

Oh, while each feature has been assigned a unique token identifier, I converted each to either a “1” if present for a house or a “0” if not present. I was trying to avoid skewing the potential feature importance with an arbitrary numbering system. Tell me if that was a stupid idea. Each individual house is identified with a “ListingId” in the data.

I used and attached here the five feature importance models after spending days reading on the topic and staring at KNIME WFs online.

As I read would happen, all of the methods came up with often substantially different answers to feature importance based on the exact same data.

The final research stated that SHAP was the magic bullet to the faults of all other methods. I tried it, but cannot figure out what to do with the results which seem to fly in all directions.

Thank you–I think this discussion may help many of us out here doing real-live feature importance analysis in the marketplace.

I’m attaching WF and Data.

The working data has been split into three groups (you have only one) based on a Clustering Algorithm prior to any of these analyses.

I’m using 5.4.0

Sold Prices by ListingID.xlsx (60.5 KB)

temp for shap testing.xlsx (1016.9 KB)

Feature Importance.knwf (194.8 KB)

The SHAP results from the attached workflow and data seem erratic and I cannot discern any specific scoring pattern of which features have the highest or lowest impact on the dependent variable.

Would someone either: 1) explain how I can use the results to rank independent variables impact on the dependent variable, or 2) explain what I did wrong and how to correct the workflow.

Thank you

Feature Importance.knwf (194.8 KB)

Sold Prices by ListingID.xlsx (60.5 KB)

temp for shap testing.xlsx (1016.9 KB)

I competed an H2O Random Forest Regression and the table on the left is the internal ranked importance measures. I also attempted a Permutation and Target Shuffling workflow with the results in the right table. (I found the permutation/target shuffling workflow in the KNIME forum and attempted a customization for my specific purposes.) (Permutation Feature Importance for Linear Regression )

I believe I must have done something wrong with the permutation workflow as the results appear to be nearly the exact opposite of the RFR results (tables above).

Would someone please look at my permutation target reshuffling workflow and tell me if you see where I may have screwed up.

Thank you

Feature Importance.knwf (194.8 KB)

Sold Prices by ListingID.xlsx (60.5 KB)

temp for shap testing.xlsx (1016.9 KB)

1 Like

Hi @creedsmith, these all seem to be variations on the same question, so I have combined them into a single topic to keep for the forum tidy. The holiday break is a bit of a slow time, but hopefully we can see if someone has some insight into your question soon.

3 Likes

Hi @ScottF any updates to my questions? Thanks

@creedssmith I took a short look and these things I want to initially mention:

  • you might want to increase the number of model to be calculated to 1,000
  • I tend to use RMSE as a metric where the lowest number is the best
  • The models without the SHAP values seem to basically give back the same features

You could try this collection of methods to see if they can improve on your model. The target variable being called “Target”.

You can also check out this about feature importance in general

Thank you @mlauber71 I will take a look at all