Stepwise regression

I am trying to prepare an example for teaching with KNIME in which I compare various feature selection methods in regression. The first model I built used the GLM node from H2O and that works fine.

I next tried using backward selection using the feature selection loop. The problem I am having is that the performance measures available to optimize using the Numeric Scorer all improve as more variables are entered. Therefore, it seem that the best model will always be with all of the predictors.

I might try to compute AIC or BIC from the regression results, but the calculation requires n (the number of cases), k (the number of predictors) and the total sum of squared residuals. I don’t see how to easily extract these values for each iteration in the loop. (A similar problem exists with forward selection.)

Am I missing something or is there another way to proceed?

It would probably be the best when you also share your workflow (if possible) here to get help.
best regards

BackwardRegressionOnExpenses.knwf (42.4 KB)
Here is an example of what I am trying to do. Processing: insurance.csv… I had to convert the data file to an Excel format rather than .csv.

insurance.xlsx (63.4 KB)


This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.