I am trying to prepare an example for teaching with KNIME in which I compare various feature selection methods in regression. The first model I built used the GLM node from H2O and that works fine.
I next tried using backward selection using the feature selection loop. The problem I am having is that the performance measures available to optimize using the Numeric Scorer all improve as more variables are entered. Therefore, it seem that the best model will always be with all of the predictors.
I might try to compute AIC or BIC from the regression results, but the calculation requires n (the number of cases), k (the number of predictors) and the total sum of squared residuals. I don’t see how to easily extract these values for each iteration in the loop. (A similar problem exists with forward selection.)
Am I missing something or is there another way to proceed?