Variable Importance Knime vs H20 Random Forest

Comparing the Knime Random Forest Learner and the H20 Random Forest Learner, the statistics are very similar (R2 0.863 vs 0.864, and RMSLE 0.063 vs 0.096); thus I would choose the Knime RFR due to lower RMSLE.

But the VARIABLE IMPORTANCE stats for the two Random Forest Regressions are nearly opposite in their ratings (please see attached below).

For the Knime RFR I used the expression I’ve seen in the forum for my calculations: ($#splits (level 0)$/$#candidates (level 0)$)+($#splits (level 1)$/$#candidates (level 1)$)+($#splits (level 2)$/$#candidates (level 2)$). H20 has a built in function.

Both regressions have static random seeds set and used the exact same dependent and independent variables.

Any ideas how the Variable Importance weights are nearly reversed–or at least which result you would personally trust more?

Attached is the data set. Ignore variable “PSFAboveGrade” and do not include it. “ClosePrice” is the dependent and columns C through G are the independent.

temp.xlsx (81.9 KB)

thanks

Hi @smithcreed -

You can read a bit about how H2O calculates variable importance at this link. In particular:

Variable importance is determined by calculating the relative influence of each variable: whether that variable was selected to split on during the tree building process, and how much the squared error (over all trees) improved (decreased) as a result.

This goes further than the manual calculation you mention above. (That manual calculation also equally weights splits at the first, second, and third levels, which is questionable.) IMO the H2O method is a bit more robust, but that’s only my opinion - I would be interested to hear what others have to say.

3 Likes

Thanks for the input. I will read the information at your link, but it sounds like you would certainly place more trust in the H20 Variable Importance and that’s the one I will use in my work here.

Hi Scott, and anyone else readying this topic. I think I solved the question.

  1. I converted a string IV to a numeric coded integer variable–just to be sure this was not a problem.
  2. I reran the results then did a simple excel worksheet to convert the Knime math formula based variable importance to a 100% scale so I could compare Knime RFR to H20 RFR Variable Importance more evenly.
  3. If you look at the excel output you’ll see the Knime RFR and H20 RFR Variable Importance stats are much more in line. There is a lot of similarity between the two.

Thanks again, and I hope this help someone else out :slight_smile: :smile:

3 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.