Forward Feature Selection: How Does it Work?

How does the forward feature selection process work in KNIME?
For e.g., if I have 10 features and 1 variable that I need to predict, does forward feature selection node take 1 feature at a time and check the accuracy of the prediction and then add the next feature to the list and see how the accuracy changes? Would that mean there are 10! (factorial) combinations of features that KNIME checks before it finds the best combination that enables the prediction of that variable?

Not exactly, Knime does not try every possible combination. When it has selected the first feature, it leaves that feature in place for the rest of the process, then it selects the second feature, and so on. So instead of 10!, the number of combinations Knime tries is 10+9+8+…=10*(10+1)/2. But I think forward feature selection is mostly used when one wants a model with relatively few features, so the process is stopped at, for example, 4 features. Then Knime will try 10+9+8+7 models.

If you want to do an exhaustive search of all possible combinations, you can use an R snippet and the regsubsets function from the “leaps” library in R.

2 Likes

Aswin explained the process and now I’m going to tell you why you shouldn’t use it especially not the suggested exhaustive search.

exhaustive search is nothing else than fancy word for p-hacking. You simply try possibly millions of combinations until you find one that works well without any guarantees that this isn’t simply pure chance. So backwards elimination or genetics algorithms for selection are in my opinion simply wrong methodology while being extremely compute intensive.

forward selection is a little less worse. At each step it simply keeps the best new feature. I would say it’s pretty similar like decision trees split on the best differentiating feature which is essentially the same thing, right? Going further with that though it’s much simpler and faster to simply use Random Forest Feature importance to select features. You can then either keep the best N features or define an importance cut-off (I prefer a relative importance cut-off, eg. importance compared to most important feature).

However before you apply random forest you need to remove correlated features (Correlation Filter) and if you whish so also use the low variance filter (low variance essentially means it will get a low feature importance anyway hence not very important to remove them before hand. However correlated features can negativley impact decision trees including RF).

What’s the advantage here? correlated features need to removed anyway and the RF part is very fast and simple and well understood. Removing the features this way even helps if you use a tree based model afterwards especially for xgboost because if you remove most features anything downstream will run faster.

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.