Feature Selection Filter misbehaving

Hi there,

I encountered a problem with the feature selection filter: it filters different columns from what the feature selection loop end would suggest…
In my workflow, I am iterating with a feature selection loop (using the genetic algorithm) over a dataset in order to get the best adjusted R² in a linear regression (yes, if someone is looking for a way to get the adjusted R2 instead of the raw R² out of a regression, you can take this part of the workflow as template). In the end, I want to apply the best feature set again to a linear regression node, yet the selected columns do not match the corresponding features.
If you look at the list of features


hard and bar should be the first columns. But if you look at the filtered table
image
peanutyalmondy, crispedricewafer and other columns not in the feature list appear. Maybe I got a setting in the filter node wrong, but I really don’t know where… Any hint will be much appreciated!

All the best,
Alec
Candy_Data_v4.knwf (414.5 KB)

By the way, the dataset comes from here: https://github.com/fivethirtyeight/data/tree/master/candy-power-ranking

Hi @Alec,

I looked into your workflow and the issue there is that you have selected “Select features manually” in the node dialog of the Feature Selection Filter. With this setting, always the features will be filtered that are marked in the node dialog. You can use the other option to set a threshold, so that the smallest feature set that achieves this threshold will be automatically selected.
If you want to always have the feature set with the best R2 score, there are two ways to achieve this. I have added those to your workflow and marked them with green annotations: Candy_Data_v4_solution.knwf (544.0 KB)

One is called “Solution 1” and does not use the Feature Selection Filter at all. This one needs several nodes and you need to manually select the static columns.
The other one is called “Solution 2” and uses flow variables to set the proper threshold in the Feature Selection Filter. It’s the more elegant way. If you’re not familiar with flow variables, take a look at https://www.knime.com/knime-introductory-course/chapter7/section1/creation-and-usage-of-flow-variables

I hope this helps you.

Cheers,
Simon

4 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.